course: Data Analytics with Statistics | lecturer: Prof. Dr. Jan Kirenz | Date: 29.12.2023 | Name: Julian Erath, Furkan Saygin, Sofie Pischl | Group: Group B

Draft analysis¶


Group name: Group iBm


Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset¶

1. Introduction¶

Scope of the Project:

The project aims to analyze and derive insights from the ERA5 (ECMWF Reanalysis version 5) weather dataset. It involves both regression and classification analyses to model and predict various weather parameters as well as gain insights on correlations, causations and patterns involving the parameters 1. The ERA5 dataset, provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), is a high-quality global atmospheric reanalysis dataset covering multiple decades 2. The analysis will focus on the region of Bancroft in Ontario, Canada for the years 2015 to 2022.

1.1 Introduction to Subject - Motivation¶

Exploring the Atmosphere Through Data Weather, a phenomenon as old as Earth itself, has long been a subject of human fascination. The complex interplay of temperature, wind, and precipitation shapes our environment, affects our lives, and challenges our understanding of the natural world 3. With the advent of technology and data analysis techniques, a deeper exploration of these complex processes is now possible. 4. The project, titled "Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset", aims to contribute to this ongoing exploration by examining how different variables interact to create complex weather phenomena.

Geographic Focus: Bancroft, Canada While this project leverages the ERA5 dataset, Bancroft, Ontario in Canada is chosen to specifically focus on, for an in-depth analysis. The geographic location of Bancroft can be seen in picture 1. This decision is rooted in the unique climatic and meteorological characteristics of the region, which offer a distinctive case study for understanding complex weather phenomena. The geographical and climatic characteristics of Bancroft (strongly influenced by the so called 'lake-effect') create an excellent case study for analyzing the relationships among different atmospheric elements, which also leads to interest in that area by IBM and IBM clients 5. The work presented is limited to the evaluation of the years 2015 to 2022 in order to include only the current and latest climate developments and to record the current status without falsifying the results due to strongly deviating values from observations further in the past. Focusing on Bancroft enables a targeted application of regression and classification techniques, facilitating a detailed analysis of localized weather patterns and their impacts 6.

Complete Map of Ontario in Canada Picture 1: Complete Map of Ontario in Canada, containing the relevant regions for the analysis (Bancroft in the centered southeast). Copyright: IBM Deutschland GmbH

1.2 Motivation for Research Question¶

The Impetus for Investigation The motivation of this project is rooted in historical interest and present-day necessity. As discussed in a systematic review titled "The contribution of weather forecast information to agriculture, water, and energy sectors in East and West Africa," accurate weather prediction is crucial for effective agriculture, disaster management, and urban planning, particularly in the context of climate change risks7. An example of the complexity in weather systems is the Lake Effect, explained in "A Hybrid Dataset of Historical Cool-Season Lake Effects From the Eastern Great Lakes of North America". The Lake Effect is a phenomenon where cold winds moving over warmer lake waters, leads to significant changes in weather patterns nearby8. The European Centre for Medium-Range Weather Forecasts (ECMWF) and its ERA5 dataset provide an great opportunity to analyze detailed weather parameters over several decades. This data, examined for its comprehensiveness in the "Evaluation of spatial-temporal variation performance of ERA5 precipitation data in China," is a example for researchers seeking to understand weather patterns and atmospheric dynamics9.

1.3 General Research Question¶

The research is guided by four pivotal questions, addressed through regression and classification analyses.

Research questions: Regression Analysis: Is it possible to accurately predict the temperature and its relationship with wind characteristics using historical data? This question is inspired by the current understanding of atmospheric variables and their interactions and specified as follows:

  • Temperature Prediction: Is it possible to build an accurate regression model to predict temperature based on historical data? This involves exploring the relationships between temperature and windspeed over time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.

  • Temperature and Wind Modeling: Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques? This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather. As an outlook to this analysis it could also be analyzed whether certain winddirections result in a certain type of weather event using logarithmic regression or classification.

  • Multivariate Temperature Prediction / Linear Regression of Temperature with Multiple Predictors: How does the incorporation of multiple atmospheric predictors, such as windspeed, windgust, winddirection, air pressure, snow and ice parameters enhance the accuracy of temperature prediction compared to a model solely based on windspeed? This question delves into the complex interplay of various atmospheric variables and their combined effect on temperature changes over different time scales (daytime, day, season, etc.). The outcome of this analysis will contribute to a nuanced model that considers a spectrum of atmospheric conditions for more precise temperature forecasting. This tasks also includes the regression model to uncover interactions and synergies among different atmospheric predictors in influencing temperature variations. This question seeks to identify potential non-linear relationships or combined effects of multiple predictors on temperature. Understanding these interactions is crucial for refining the predictive model and gaining insights into the intricate dynamics of the atmosphere. Furthermore temporal dynamics are to be analyzed. How does the temporal aspect, including diurnal patterns and seasonal variations, impact the relationship between temperature and multiple predictors? This question aims to capture the temporal dynamics of the atmosphere and assess how the predictive power of the model varies across different timeframes. Examining temporal patterns will contribute to a more nuanced understanding of how atmospheric predictors influence temperature fluctuations under varying conditions.

  • Extreme Weather Event Prediction by Temperature in Logistic Regression: Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges? This question involves categorizing temperature into distinct classes and determining whether certain temperature ranges are indicative of extreme weather events. The outcome of this analysis will contribute to weather an observation is an extreme weather event, providing valuable insights into the correlation between temperature and specific meteorological phenomena. To achieve this, this tasks also includes to find out, what the critical temperature thresholds associated with the onset or intensification of different weather events are. This question aims to identify temperature breakpoints that signify transitions between different weather conditions. Understanding these thresholds is essential for developing early warning systems and enhancing preparedness for weather-related events.

Classification Analysis: Can effective classification and prediction of weather events, including extreme occurrences, be achieved based on multivariate weather data? This aspect of the research aims to develop methods for accurate prediction of weather events, acknowledging the complexity and variability of weather patterns and is specified as follows:

  • Extreme Weather Events in Binary Classification: Is it possible to classify and predict extreme weather events such as storms? This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.

  • Weather Event and Pattern Classification in Multiclass Classification: Is it possible to categorize and predict different extreme weather events based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the prediction of certain weather events based on the current weather data and a model that was trained on historical weather data.

1.4 Hypotheses Regarding Question¶

The hypotheses are twofold, corresponding to the projects dual analytical approach:

In Regression: There exists a significant correlation between temperature and wind characteristics, which can be modeled to predict future temperature trends and variations. This hypothesis is based on the premise that atmospheric variables are interconnected and can be analyzed to forecast weather conditions.

In Classification: Specific patterns in the weather data can accurately predict various weather events, including extreme conditions. This hypothesis is informed by the need for effective prediction models in the face of increasingly frequent and severe weather events.


  1. de Lima, Glauston, R.T. / Stephan, S. (2013): A new classification approach for detecting severe weather patterns, in: Computers & geosciences 57 (2013): 158-165.↩

  2. ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩

  3. Liljequist, G.H. / Cehak, K. (1984): Allgemeine Meteorologie. 3. Auflage, Springer-Verlag.↩

  4. Fathi, M. / Haghi Kashani, M. / Jameii, S. M. / Mahdipour, E. (2022): Big Data Analytics in Weather Forecasting: A Systematic Review, in: Archives of Computational Methods in Engineering 29.2 (2022, Springer): 1247–1275↩

  5. Hjelmfelt, M.R. (1990): Numerical study of the influence of environmental conditions on lake-effect snowstorms over Lake Michigan, in: Monthly Weather Review, 118(1), pp.138-150.↩

  6. Ghirardelli, J.E. (2005): An Overview of the Redeveloped Localized Aviation Mos Program (Lamp) For Short-Range Forecasting.↩

  7. The contribution of weather forecast information to agriculture, water, and energy sectors in East and West Africa↩

  8. A Hybrid Dataset of Historical Cool-Season Lake Effects From the Eastern Great Lakes of North America↩

  9. Evaluation of spatial-temporal variation performance of ERA5 precipitation data in China↩

2. Setup¶

2.1 Methodological Insights and Impact¶

This project employs a range of statistical and machine learning techniques to analyze the ERA5 dataset. The outcomes of this research are expected to not only enhance meteorological understanding but also provide practical tools for weather prediction, with implications for environmental policy and public safety. By harnessing the power of the ERA5 dataset 1 and advanced analytical techniques 2, this project aims to contribute valuable insights to the field of meteorology and support informed decision-making in a world increasingly affected by weather-related challenges.

2.2 Methodology Overview¶

This project uses a multifaceted approach, primarily utilizing Design Science Research (DSR) 3 in line with Hesse's framework4. This methodological framework focuses on the creation and critical evaluation of artifacts to address specific problems. In this DSR approach, complex weather phenomena are identified as problems to be addressed. Additionally, iterative prototyping 5 is employed, enabling systematic refinement of models and methods based on continuous evaluation and integration of data-driven insights6. This project combines the DSR cycle by Gregor / Hevner, with iterative prototyping by Wilde / Hess and Goldman / Narayanaswamy. This integration fosters a dynamic environment where each prototype's development and evaluation progressively inform subsequent cycles of design and analysis. This leads into a cycle of artifact creation (in this case, models and algorithms) specifically tailored to analyze weather patterns in Bancroft using the ERA5 dataset. These models will be refined continuously through iterative prototyping, where each iteration's outcomes inform the next cycle, ensuring they are increasingly effective and accurate. The artifacts are then rigorously evaluated against the research questions. The results are evaluated in every iteration using cross-validation by Shao 1993 and Browne 20007. This process is enriched by a comprehensive literature review by Webster / Watson 8, conducted before and during the implementation, ensuring the methods and analyses remain aligned with current meteorological and data science advancements. The amalgamation of DSR, iterative prototyping, cross-validation and literature research forms the foundation of this approach, ensuring a thorough and robust analysis that is well-suited to address the complexities in atmospheric data analysis.


  1. ECMWF (2023b): ERA5: reanalysis datasets for forecasts. URL: https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5↩

  2. Fathi, M. / Haghi Kashani, M. / Jameii, S. M. / Mahdipour, E. (2022): Big Data Analytics in Weather Forecasting: A Systematic Review, in: Archives of Computational Methods in Engineering 29.2 (2022, Springer): 1247–1275↩

  3. Gregor, S. / Hevner, A.R. (2013): Positioning and Presenting Design Science Research for Maximum Impact, in: MIS Quarterly, Jg. 37, Nr. 2, S. 337-355; Hevner, A. / Chatterjee, S. (2010): Design Research in Information Systems, Theory and Practice. Hrsg. von R. Sharda/S. Voß. Bd. 22. Integrated Series in Information Systems. New York, NY, USA: Springer New York, NY.; Hevner, A. / March, S.T. / Park, J. / Ram, S. (2004): Design Science in Information Systems Research, in: MIS Quaterly 28.1, S. 75–105.↩

  4. Design Science in Information Systems Research.↩

  5. Wilde, T. and Hess, T., 2007. Forschungsmethoden der wirtschaftsinformatik. Wirtschaftsinformatik, 4(49), pp.280-287.; Goldman, N. and Narayanaswamy, K., 1992, June. Software evolution through iterative prototyping. In Proceedings of the 14th international conference on Software engineering (pp. 158-172).↩

  6. Reflective physical prototyping through integrated design, test, and analysis↩

  7. Shao, J., 1993. Linear model selection by cross-validation. Journal of the American statistical Association, pp.486-494.; Browne, M.W., 2000. Cross-validation methods. Journal of mathematical psychology, 44(1), pp.108-132.↩

  8. Webster, J. / Watson, R.T. (2002): Analyzing the past to prepare for the future: Writing a literature review, in: MIS quarterly. Jun 1: xiii-xiii.↩

3. Data¶

3.1 Import data¶

There are two code blocks with imports in the notebook. The imports in import1 (the code below) encompass all the imports needed in chapters 3 and 4. At the beginning of chapter 5, there is a second import block (import2), which consolidates the imports specific to chapter 5. If an import is only needed once, it is included at the place where it is used.

In [1]:
#import basic dependencies (import1)
import pandas as pd
import numpy as np

import datetime
import math

import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns

from sklearn.preprocessing import StandardScaler

import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
In [2]:
df_full = pd.read_csv("../data/external/feature_data_substation_bancroft_labelled.csv")
df_full.head()
Out[2]:
Unnamed: 0 substation run_datetime valid_datetime horizon avg_temp avg_windspd avg_windgust avg_pressure_change avg_snow ... MA_avg_winddir_month MA_avg_temp_change_month label0 label1 label2 label3 wep storm_id2 year month
0 0 Bancroft 2015-07-15 00:00:00 2015-07-15 00:00:00 0.0 287.389224 3.386380 11.136197 52.892217 0.0 ... 80.302464 NaN 0 1 NaN 1 Blue sky day NaN 2015 7
1 1 Bancroft 2015-07-15 01:00:00 2015-07-15 01:00:00 0.0 287.378997 3.326687 11.002795 50.256685 0.0 ... 78.584418 -0.010227 0 1 NaN 1 Blue sky day NaN 2015 7
2 2 Bancroft 2015-07-15 02:00:00 2015-07-15 02:00:00 0.0 287.388845 3.243494 10.700595 47.944054 0.0 ... 77.809235 -0.000189 3 1 NaN 1 Blue sky day NaN 2015 7
3 3 Bancroft 2015-07-15 03:00:00 2015-07-15 03:00:00 0.0 287.427324 3.145505 10.323983 45.855264 0.0 ... 77.931830 0.012700 2 1 NaN 1 Blue sky day NaN 2015 7
4 4 Bancroft 2015-07-15 04:00:00 2015-07-15 04:00:00 0.0 287.489158 3.047607 9.921157 44.823453 0.0 ... 79.272034 0.024983 2 1 NaN 1 Blue sky day NaN 2015 7

5 rows × 184 columns

3.2 Data structure¶

3.2.1 Data Description¶

  • Source: ERA5 dataset by ECMWF (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)
  • Temporal Coverage: Multiple decades (2015-2022) (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
  • Spatial Resolution: Approximately 31 km (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation)
  • Parameters/Variables: Includes but not limited to temperature, precipitation, wind speed, atmospheric pressure, etc. (https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings)
  • Labels: The data has been labelled in weather event profiles by meteorologists and data scientists from IBM and The Weather Company.
  • Observations: The ERA5 weather dataset provides comprehensive atmospheric information on a global scale, with each observation (row) in the dataset representing a set of meteorological parameters at a specific location and time.

The ERA5 dataset is provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a product of atmospheric reanalysis, a process that involves assimilating observational data from various sources into a numerical weather prediction model.

3.2.2 Data Characterisitcs¶

  • Temporal Aspect: The dataset provides temporal information at various intervals (e.g., hourly or monthly), allowing for both short-term and long-term analyses.

  • Spatial Aspect: With a spatial resolution of around 31 km, the dataset provides a comprehensive global coverage, facilitating region-specific studies. In this project a geographical resolution for the region Bancroft in Ontario, Canada is chosen (as this area is of particular interest for IBM).

  • Multivariate Nature: Multiple meteorological parameters are available, enabling a comprehensive analysis of weather patterns.

  • Quality: ERA5 is known for its high quality and precision, making it suitable for detailed analyses and modeling.

3.2.3 Variables / Parameters¶

The ERA5 dataset is comprehensive, providing a wide range of meteorological variables that cover various aspects of the Earth's atmosphere. The observations and general characteristics measured in the ERA5 dataset are crucial for understanding and analyzing weather and climate patterns. Some of the key variables included in the dataset are1:

  1. Temperature: ERA5 provides information on air temperature at different levels of the atmosphere. This includes 2-meter temperature, which is often used as a representative of surface temperature.

  2. Wind: Wind speed, gust and direction are essential meteorological parameters. ERA5 includes information on both zonal (east-west) and meridional (north-south) components of the wind at different altitudes.

  3. Precipitation: Precipitation data includes rainfall and snowfall. This information is crucial for understanding water cycles and is vital for various applications, including hydrology and agriculture.

  4. Pressure: Atmospheric pressure at different levels is provided in the dataset. Changes in atmospheric pressure are associated with weather patterns and can influence local weather conditions.

  5. Humidity: Relative humidity and specific humidity are included, providing insights into the moisture content of the atmosphere. This is important for understanding cloud formation and precipitation processes.

  6. Snow and ice: Snow density as well as cumulative snow and ice are included.

  7. Other Atmospheric Parameters

These variables collectively provide a detailed snapshot of the Earth's atmosphere and are fundamental for conducting both regression and classification analyses. Researchers and meteorologists often leverage these variables to study climate trends, weather patterns, and environmental changes. The high spatial and temporal resolution of the ERA5 dataset enhances its utility for diverse applications, including climate research, environmental monitoring, and weather forecasting.

3.2.4 Population and sample¶

The data population of the ERA5 dataset refers to the entirety of the atmospheric observations encompassed by the ERA5 dataset. It includes all global weather data recorded and simulated by the ERA5 reanalysis project. The population covers a vast range of geographical locations, spanning the entire globe, and spans several decades, as reanalysis data provides long-term climate records 2.

The used sample for this project is a subset of the data population that is extracted and analyzed for the specific study and investigation. The sample is tailored based on temporal, spatial, or variable-specific criteria. In this instance the sample subset corresponds to the region of Bancroft in Ontario, Canada for the timespan of 2015 to 2022. For this region and timespan the data is holistic and and focusing on the meteorological parameters stated under 7. Working with this specific sample allows to address the specific research, conducting exploratory analyses and creating targeted visualizations for only the interesting area while also saving computing resources by not calculating on redundant data or falsify the analysis by using irrelevant data for the research goal.

3.2.5 Overview of data and structure¶

In the following code snippets the data will be transformed to be able to better visualize and analyse it. Some features might be recalculated (e.g., by calculating the average for every week for visualization).

At first the data is loaded from the .csv file and the head of the data is printed to have a first look at the columns (variables) and rows (observations) of the data. The dataset consists of 65345 observations, 184 columns (including the useful and unique predictor variables and reasonable response variable mentioned in chapter "Variables / parameters", identifier variables "Unnamed: 0" (this might previously be used as an index and will be dropped for the further analysis becuase of redundancy), run_datetime and valid_datetime (which contain the datetime for the observation, which is not only an identifier variable but can also be used in the analysis e.g., time series analysis)). In a next step the distribution and statistics of the variables will be taken a closer look at by analyzing their quantiles, minima, maxima and maxima.

3.2.6 Description of how the Data was collected¶

Collection Method 3:

  1. Observational Data Assimilation: The ERA5 dataset is generated through a process called reanalysis. This involves assimilating vast amounts of observational data from satellites, weather stations, and other sources into a numerical weather prediction model. This assimilation process helps create a consistent and comprehensive representation of the atmosphere over time.

  2. Numerical Weather Prediction Model: ECMWF uses advanced numerical weather prediction models to simulate the Earth's atmosphere. These models take into account the laws of physics governing the atmosphere and use initial conditions based on observational data.

  3. Temporal Coverage: The ERA5 dataset spans multiple decades, with ongoing updates. It covers from 1979 to near-real-time, providing a continuous record of atmospheric conditions. In this project the scope of the data is reduced to the dates of 2015 to 2022 for reasons of resources and project limitations.

  4. Spatial Resolution: The dataset has a high spatial resolution, offering detailed information on a global scale, with grid points approximately 31 kilometers apart.

Purpose: The primary purpose of generating the ERA5 dataset is to provide a comprehensive and accurate representation of past weather conditions. It serves various scientific and operational applications, including climate monitoring, research, and supporting weather forecasting.

Quality Assurance: ECMWF is renowned for its commitment to data quality. The assimilation process involves rigorous validation against a wide range of observational data to ensure the accuracy and reliability of the reanalysis output.

Accessibility: The ECMWF makes ERA5 data available to the public and the scientific community, fostering research and applications in meteorology, climate science, and related fields.


  1. ECMWF (2023c): ERA5: data documentation parameterlistings. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings↩

  2. ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩

  3. ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩

In [3]:
print(df_full.info())
print(df_full.dtypes)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65345 entries, 0 to 65344
Columns: 184 entries, Unnamed: 0 to month
dtypes: float64(166), int64(8), object(10)
memory usage: 91.7+ MB
None
Unnamed: 0          int64
substation         object
run_datetime       object
valid_datetime     object
horizon           float64
                   ...   
label3              int64
wep                object
storm_id2          object
year                int64
month               int64
Length: 184, dtype: object

3.3 Data corrections¶

In [4]:
# Change datetime columns to datetime format
df_full['run_datetime'] = pd.to_datetime(df_full['run_datetime'])
df_full['valid_datetime'] = pd.to_datetime(df_full['valid_datetime'])

# Add temperature column for degree in celsius
df_full['avg_temp_celsius'] = df_full['avg_temp'] - 273.15

# Translate wind directions into cardinal directions
def wind_direction_label(degrees):
    if 22.5 < degrees <= 67.5:
        return 'Northeast'
    elif 67.5 < degrees <= 112.5:
        return 'East'
    elif 112.5 < degrees <= 157.5:
        return 'Southeast'
    elif 157.5 < degrees <= 202.5:
        return 'South'
    elif 202.5 < degrees <= 247.5:
        return 'Southwest'
    elif 247.5 < degrees <= 292.5:
        return 'West'
    elif 292.5 < degrees <= 337.5:
        return 'Northwest'
    else:
        return 'North'
    
df_full['wind_direction_label'] = df_full['avg_winddir'].apply(wind_direction_label)

3.4 Variable lists and overview¶

In the pursuit of extracting meaningful insights and constructing predictive models from a given dataset, it is imperative to establish a comprehensive understanding of the variables involved. This section delineates the variables within the dataset, providing a structured overview of each variable's key characteristics through a data dictionary. The data dictionary encapsulates essential information for each variable, facilitating a streamlined comprehension of their names, descriptions, roles, types, and formats.

While not explicitly used as predictors or responses in modeling, the inclusion of ID columns can offer valuable insights for data exploration and understanding. These columns serve as unique identifiers and are crucial for tracking and referencing specific records within the dataset.

The meticulous documentation provided in this variable list serves as a foundation for subsequent analyses, ensuring transparency, reproducibility, and informed decision-making throughout the data exploration and modeling processes.

3.4.1 Data Dictionary¶

In [5]:
data = {
    'Name': [
        'run_datetime',
        'wep',
        'avg_temp',
        'min_wet_bulb_temp',
        'avg_dewpoint',
        'avg_temp_change',
        'avg_windspd',
        'max_windgust',
        'avg_winddir',
        'wind_direction_label',
        'max_cumulative_precip',
        'max_snow_density_6',
        'max_cumulative_snow',
        'max_cumulative_ice',
        'avg_pressure_change'
        ],
    'Description': [
        'Date and time when the weather observations were recorded.',
        'Weather Event Type (WEP) is a categorization of weather conditions based on specific parameters at a given location and time.',
        'The average temperature measured at two meters above ground level, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Minimum wet bulb temperature recorded during the observation period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Average dewpoint temperature observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Average change in temperature during the observation period, obtained by calculating the difference between this observation and the following.',
        'Average wind speed measured during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Maximum wind gust observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Average wind direction (in degree) observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Wind direction (in cardinal direction) observed during the recording period, obtained by recalculating the avg_winddir parameter.',
        'Maximum cumulative precipitation recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Maximum snow density at a depth of 6 inches, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Maximum cumulative snow recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Maximum cumulative ice recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
        'Average change in atmospheric pressure during the observation period, considering all sensors, for the entire duration of one hour in the Bancroft region.'
        ],
    'Role': [
        'ID / predictor', 
        'response', 
        'response / predictor', 
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor',
        'predictor'
        ],
    'Type': [
        'numerical continuous / ID', 
        'categorical nominal', 
        'numerical continuous', 
        'numerical continuous', 
        'numerical continuous', 
        'numerical continuous', 
        'numerical continuous', 
        'numerical continuous', 
        'numerical continuous', 
        'categorical ordinal', 
        'numerical continuous',
        'numerical continuous',
        'numerical continuous',
        'numerical continuous',
        'numerical continuous'
        ],
    'Format': [
        type(df_full['run_datetime'][0]),
        type(df_full['wep'][0]), 
        type(df_full['min_wet_bulb_temp'][0]),
        type(df_full['avg_temp'][0]), 
        type(df_full['avg_dewpoint'][0]), 
        type(df_full['avg_temp_change'][0]), 
        type(df_full['avg_windspd'][0]), 
        type(df_full['max_windgust'][0]),
        type(df_full['avg_winddir'][0]), 
        type(df_full['wind_direction_label'][0]), 
        type(df_full['max_cumulative_precip'][0]), 
        type(df_full['max_snow_density_6'][0]),
        type(df_full['max_cumulative_snow'][0]), 
        type(df_full['max_cumulative_ice'][0]), 
        type(df_full['avg_pressure_change'][0])
]
}

data_dict_df = pd.DataFrame(data)

# Display the data dictionary
data_dict_df
Out[5]:
Name Description Role Type Format
0 run_datetime Date and time when the weather observations we... ID / predictor numerical continuous / ID <class 'pandas._libs.tslibs.timestamps.Timesta...
1 wep Weather Event Type (WEP) is a categorization o... response categorical nominal <class 'str'>
2 avg_temp The average temperature measured at two meters... response / predictor numerical continuous <class 'numpy.float64'>
3 min_wet_bulb_temp Minimum wet bulb temperature recorded during t... predictor numerical continuous <class 'numpy.float64'>
4 avg_dewpoint Average dewpoint temperature observed during t... predictor numerical continuous <class 'numpy.float64'>
5 avg_temp_change Average change in temperature during the obser... predictor numerical continuous <class 'numpy.float64'>
6 avg_windspd Average wind speed measured during the recordi... predictor numerical continuous <class 'numpy.float64'>
7 max_windgust Maximum wind gust observed during the recordin... predictor numerical continuous <class 'numpy.float64'>
8 avg_winddir Average wind direction (in degree) observed du... predictor numerical continuous <class 'numpy.float64'>
9 wind_direction_label Wind direction (in cardinal direction) observe... predictor categorical ordinal <class 'str'>
10 max_cumulative_precip Maximum cumulative precipitation recorded, con... predictor numerical continuous <class 'numpy.float64'>
11 max_snow_density_6 Maximum snow density at a depth of 6 inches, c... predictor numerical continuous <class 'numpy.float64'>
12 max_cumulative_snow Maximum cumulative snow recorded, considering ... predictor numerical continuous <class 'numpy.float64'>
13 max_cumulative_ice Maximum cumulative ice recorded, considering a... predictor numerical continuous <class 'numpy.float64'>
14 avg_pressure_change Average change in atmospheric pressure during ... predictor numerical continuous <class 'numpy.float64'>

3.4.2 Response Variable: Temperature¶

The variable temperature is a numerical continous variable, which is measured in units of Kelvin.

Temperature is a measure of the thermal energy in a substance or system. It is a fundamental physical quantity. Temperature is a measure of the average kinetic energy of the particles in a substance or system. As the temperature of a substance increases, the particles in the substance gain kinetic energy and move faster. In the dataframe there are different kinds of temperature given.

3.4.3 Response Variable: Weather Event¶

The weather event is a categorical nominal variable with the following values:

Blue Sky Day: A day with clear and cloudless skies. There are no significant weather phenomena such as rain, snow, or storms.1

Mild Snowfall: A light snowfall, where small amounts of snow fall, but not enough to be considered moderate or heavy snowfall.

Moderate Snowfall: A moderate snowfall, where larger amounts of snow fall compared to mild snowfall but not as much as heavy snowfall.

Moderate Rain: A moderate rainfall, where a moderate amount of precipitation occurs in the form of rain.

Heavy Snowfall with Accumulated Snow: Intense snowfall accompanied by a significant accumulation of snow on the ground.

Continuous Freezing Rain: Continuous freezing rain, where rain falls on frozen surfaces and immediately turns into ice.

Storm with Freezing Rain / Heavy Snow- and Icestorm: A storm with freezing rain or heavy snowfall, accompanied by the formation of ice. This can lead to hazardous conditions.

Snowstorm with High Precipitation: A snowstorm with intense snowfall and a high amount of precipitation in the form of snow.

However, when conducting weather analyses and predictions, primary focus lies on the examination and anticipation of extreme weather conditions. Therefore it is not neccessary to differentiate between degrees of cloudyness.


  1. It is important to note that the designation "blue sky" encapsulates a range of commonplace weather situations. Consequently, even on a day characterized as a "blue sky" day there may be a degree of cloudiness.↩

3.5 Data splitting¶

This chapter delves into the crucial process of data splitting in machine learning, focusing on strategies for effectively partitioning datasets into training, testing, and validation sets. A hands-on implementation using the scikit-learn library is presented, highlighting the significance of proper data splitting for robust model evaluation. The chapter concludes with discussions on potential challenges and best practices 1.

Data splitting is a fundamental step in machine learning model development, playing a pivotal role in assessing model performance. The primary goal is to strike a balance between training a model that captures underlying patterns in the data and evaluating its generalization on unseen data. The need for data splitting arises from the desire to create models that generalize well to new, unseen instances. Overfitting, where a model memorizes training data but fails on new data, is a common concern. Conversely, underfitting occurs when a model is too simplistic to capture underlying patterns 2.

Before splitting the data intro training, test and validation data, the dataframe is reduced to only the needed columns to save resources.

3.5.1 Strategies for Data Splitting¶

Random Splitting: Randomly partitioning the data into training and testing sets is a straightforward strategy. This method is suitable for well-shuffled datasets, ensuring a representative sample for both training and testing.

Stratified Splitting: In classification tasks, maintaining the original class distribution is crucial. Stratified splitting ensures that the proportion of each class remains consistent in both training and testing sets, preventing bias.

Temporal Splitting: For time-series data, temporal splitting is essential. It involves training on earlier time periods and testing on later ones, mimicking real-world scenarios where models encounter new data sequentially.

Implementation with scikit-learn The scikit-learn library provides a user-friendly interface for data splitting. The train_test_split function allows easy division of datasets into training and testing sets, with an optional validation set.

3.5.2 Challenges and Best Practices¶

Overfitting and Underfitting: Care must be taken to strike a balance between overfitting and underfitting during model training. Regularization techniques and proper hyperparameter tuning can mitigate these issues.

Hyperparameter Tuning: Hyperparameter tuning is an ongoing process, and the performance of models may be sensitive to changes in hyperparameters. Robust evaluation using validation sets aids in identifying optimal hyperparameter configurations.

Conclusion Proper data splitting is foundational for developing reliable machine learning models. The presented strategies, coupled with the scikit-learn implementation, empower practitioners to create models that generalize well to new data. The chapter underscores the importance of thoughtful data splitting in the pursuit of effective machine learning models.


  1. Scikit-learn (2023a): https://scikit-learn.org/stable/documentation.html↩

  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.↩

In [6]:
#Creating a new dataframe with only the used columns (saves resources)
columns_to_keep = [
    'run_datetime',
    'wep',
    'avg_temp',
    'avg_temp_celsius',
    'min_wet_bulb_temp',
    'avg_dewpoint',
    'avg_temp_change',
    'avg_windspd',
    'max_windgust',
    'avg_winddir',
    'avg_winddir_sin',
    'avg_winddir_cos',
    'wind_direction_label',
    'max_cumulative_precip',
    'max_snow_density_6',
    'max_cumulative_snow',
    'max_cumulative_ice',
    'avg_pressure_change',
    'label0', #additional wep labels for various blue sky events
    'label1', #additional wep labels to differentiate between blue sky and extreme weather events
    'label2'
]

# Use loc to select only the specified columns
df = df_full.loc[:, columns_to_keep]
df
Out[6]:
run_datetime wep avg_temp avg_temp_celsius min_wet_bulb_temp avg_dewpoint avg_temp_change avg_windspd max_windgust avg_winddir ... avg_winddir_cos wind_direction_label max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice avg_pressure_change label0 label1 label2
0 2015-07-15 00:00:00 Blue sky day 287.389224 14.239224 280.809506 280.735246 NaN 3.386380 14.899891 80.302464 ... 0.190676 East 2.009 0.0 0.000 0.0 52.892217 0 1 NaN
1 2015-07-15 01:00:00 Blue sky day 287.378997 14.228997 280.809506 280.414058 -0.010227 3.326687 14.899891 76.866373 ... 0.102466 East 1.209 0.0 0.000 0.0 50.256685 0 1 NaN
2 2015-07-15 02:00:00 Blue sky day 287.388845 14.238845 280.809506 280.187074 0.009848 3.243494 14.899891 76.258867 ... 0.651950 East 0.400 0.0 0.000 0.0 47.944054 3 1 NaN
3 2015-07-15 03:00:00 Blue sky day 287.427324 14.277324 280.809506 280.049330 0.038479 3.145505 14.899891 78.299616 ... -0.971290 East 0.000 0.0 0.000 0.0 45.855264 2 1 NaN
4 2015-07-15 04:00:00 Blue sky day 287.489158 14.339158 280.809506 279.980697 0.061834 3.047607 14.702229 84.632852 ... -0.981976 East 0.000 0.0 0.000 0.0 44.823453 2 1 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65340 2022-12-27 04:00:00 Moderate rain 264.241641 -8.908359 260.284794 262.061976 -0.124561 1.962197 8.444256 232.606824 ... 0.991695 Southwest 2.126 0.0 25.643 0.0 NaN 5 0 3.0
65341 2022-12-27 05:00:00 Blue sky day 264.115391 -9.034609 260.284794 262.114357 -0.126250 1.978823 7.475906 229.938704 ... -0.823955 Southwest 2.226 0.0 21.161 0.0 NaN 5 1 NaN
65342 2022-12-27 06:00:00 Blue sky day 264.024853 -9.125147 260.284794 262.206179 -0.090537 2.005855 7.305549 227.024163 ... 0.675251 Southwest 2.426 0.0 16.430 0.0 NaN 5 1 NaN
65343 2022-12-27 07:00:00 Blue sky day 264.048368 -9.101632 260.284794 262.350025 0.023514 2.040978 7.305549 223.900355 ... -0.662027 Southwest 2.826 0.0 10.859 0.0 NaN 5 1 NaN
65344 2022-12-27 08:00:00 Blue sky day 263.918722 -9.231278 260.284794 262.512490 -0.129646 2.078741 6.818578 220.894487 ... 0.554528 Southwest 3.426 0.0 5.640 0.0 NaN 0 1 NaN

65345 rows × 21 columns

In [7]:
from sklearn.model_selection import StratifiedShuffleSplit

# Initialize StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Split the data using StratifiedShuffleSplit
for train_index, test_temp_index in sss.split(df, df['wep']):
    train_data = df.iloc[train_index]
    temp_data = df.iloc[test_temp_index]

# Further split temp_data into test and validation sets
for test_index, validation_index in sss.split(temp_data, temp_data['wep']):
    test_data = temp_data.iloc[test_index]
    validation_data = temp_data.iloc[validation_index]

# Display the shapes of the resulting sets
print("Train Data Shape:", train_data.shape)
print("Test Data Shape:", test_data.shape)
print("Validation Data Shape:", validation_data.shape)

# Display the distribution of labels 'wep'
print("\nDistribution of 'wep' in Train Data:")
print(train_data['wep'].value_counts())

print("\nDistribution of 'wep' in Test Data:")
print(test_data['wep'].value_counts())

print("\nDistribution of 'wep' in Validation Data:")
print(validation_data['wep'].value_counts())
Train Data Shape: (52276, 21)
Test Data Shape: (10455, 21)
Validation Data Shape: (2614, 21)

Distribution of 'wep' in Train Data:
wep
Blue sky day                                           42106
Mild snowfall                                           3598
Moderate snowfall                                       2336
Moderate rain                                           2104
Heavy snowfall with accumulated snow                     832
Frondurchlauf / Continuous freezing rain                 747
Storm with freezing rain / Heavy snow- and icestorm      344
Snowstorm with high precipitation                        209
Name: count, dtype: int64

Distribution of 'wep' in Test Data:
wep
Blue sky day                                           8421
Mild snowfall                                           719
Moderate snowfall                                       467
Moderate rain                                           421
Heavy snowfall with accumulated snow                    166
Frondurchlauf / Continuous freezing rain                150
Storm with freezing rain / Heavy snow- and icestorm      69
Snowstorm with high precipitation                        42
Name: count, dtype: int64

Distribution of 'wep' in Validation Data:
wep
Blue sky day                                           2106
Mild snowfall                                           180
Moderate snowfall                                       117
Moderate rain                                           105
Heavy snowfall with accumulated snow                     42
Frondurchlauf / Continuous freezing rain                 37
Storm with freezing rain / Heavy snow- and icestorm      17
Snowstorm with high precipitation                        10
Name: count, dtype: int64

4 Analysis¶

4.1 Descriptive statistics¶

In [8]:
df.describe().T
Out[8]:
count mean min 25% 50% 75% max std
run_datetime 65345 2019-04-06 14:09:11.362766848 2015-07-15 00:00:00 2017-05-25 23:00:00 2019-04-07 01:00:00 2021-02-14 16:00:00 2022-12-27 08:00:00 NaN
avg_temp 65345.0 279.574328 243.849393 271.114219 279.882735 289.903226 300.934144 11.383325
avg_temp_celsius 65345.0 6.424328 -29.300607 -2.035781 6.732735 16.753226 27.784144 11.383325
min_wet_bulb_temp 65345.0 273.601253 239.056465 265.232651 274.245761 284.128378 294.642287 11.914203
avg_dewpoint 65345.0 274.972135 238.429236 267.161062 275.429991 284.366793 294.97882 10.93508
avg_temp_change 65344.0 -0.000361 -1.168328 -0.095711 0.012257 0.10334 1.104699 0.191027
avg_windspd 65345.0 2.440013 0.643393 1.882336 2.34648 2.893087 6.470306 0.77288
max_windgust 65345.0 12.659095 3.918856 9.966063 12.266102 14.895518 28.236084 3.721124
avg_winddir 65345.0 210.247462 21.17728 165.321173 222.129253 262.922675 346.805959 70.896648
avg_winddir_sin 65345.0 -0.008188 -1.0 -0.715716 -0.012884 0.699482 1.0 0.707159
avg_winddir_cos 65345.0 0.003304 -1.0 -0.704408 0.008696 0.708963 1.0 0.70701
max_cumulative_precip 65345.0 4.796664 0.0 0.1 1.528 6.2 83.471 7.64858
max_snow_density_6 65345.0 21.123859 0.0 0.0 0.0 0.0 722.582771 67.439046
max_cumulative_snow 65345.0 17.454941 0.0 0.0 0.0 8.257 920.464 53.768892
max_cumulative_ice 65345.0 0.056319 0.0 0.0 0.0 0.0 9.955316 0.405855
avg_pressure_change 56672.0 0.017391 -168.315674 -19.263538 -0.071299 19.549097 153.811455 33.88573
label0 65345.0 2.027072 0.0 0.0 2.0 3.0 7.0 1.933712
label1 65345.0 0.805463 0.0 1.0 1.0 1.0 1.0 0.395847
label2 12712.0 3.06191 0.0 1.0 3.0 5.0 6.0 2.126446

After getting a closer overlook over the data some basic explorative data analysis will be conducted to visualize the mean, quantiles, shape, distribution, variance, etc. of the data and analyze the correlation of the variables.

At first, the most relevant predictor and response variables are calculated (already done in previous work of IBM with the dataset) in order to visualize their development over time in a time series analysis. The temperature measurements average temperature (avg_temp), minimal wet bulb temperature (min_wet_bulb_temp) and average temperature change (avg_temp_change) are shown in Kelvin. Measurements of wind are given as the average windspeed (avg_windspd), the maximal windgust (max_windgust) and the average winddirection (avg_winddir). Weather events are measured as maximal cumulaive precipation (max_cum_precip), maximal snow density (max_snow_density_6), maximal cumulative snow (mamax_cumulative_snow), maximal cumulative ice (max_cumulative_ice) and the average pressure change (avg_pressure_change).

4.2 Exploratory data analysis¶

In [9]:
import matplotlib.dates as mdates

rcParams['figure.figsize'] = 20, 15
sns.set(style="darkgrid")

parameters = [
        ('avg_temp', 'Kelvin'),
        ('min_wet_bulb_temp', 'Kelvin'),
        ('avg_dewpoint', 'Kelvin'),
        ('avg_temp_change', 'Kelvin'),
        ('avg_windspd', 'm/s'),
        ('max_windgust', 'm/s'),
        ('avg_winddir', 'degrees'),
        ('max_cumulative_precip', 'mm'),
        ('max_snow_density_6', 'kg/m^3'),
        ('max_cumulative_snow', 'mm'),
        ('max_cumulative_ice', 'mm'),
        ('avg_pressure_change', 'Pa')
    ]

def calculate_moving_averages(df, variables, time_periods): # has already been done in previous work at IBM (by Julian Erath) for the dataset # selected_variables = ['max_windgust', 'min_wet_bulb_temp', 'max_snow_density_6', 'max_cumulative_precip', 'max_cumulative_ice', 'max_cumulative_snow', 'avg_windspd','avg_pressure_change','avg_temp','avg_dewpoint','avg_winddir','avg_temp_change'] # selected_time_periods_week = [168] # selected_time_periods_month = [720]
    for variable in variables:
        for period in time_periods:
            ma_column_name = f'MA_{variable}_{period}_hours'
            df[ma_column_name] = df[variable].rolling(period, min_periods=1).mean()



def plot_weather_parameters(ax, data, parameter, unit):
    ax.plot(data.run_datetime, data[parameter], label=f'{parameter} Actual')
    ax.plot(data.run_datetime, data[f"MA_{parameter}_week"], label=f'{parameter} Weekly MA')
    ax.plot(data.run_datetime, data[f"MA_{parameter}_month"], label=f'{parameter} Monthly MA')
    ax.set_title(f'Bancroft {parameter}')
    ax.set_ylabel(unit)

    # Set the x-axis major locator to YearLocator and the formatter to the year
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

    # Manually add the mid-year tick for July 1st
    mid_years = [mdates.date2num(datetime.datetime(year, 7, 1)) for year in range(data.run_datetime.dt.year.min(), data.run_datetime.dt.year.max() + 1)]
    ax.set_xticks(list(ax.get_xticks()) + mid_years, minor=True)
    ax.xaxis.set_minor_formatter(mdates.DateFormatter('/%m'))

    # Rotate the labels for better visibility
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=90, ha='right')

def create_subplots(data):
    fig, axes = plt.subplots(4, 3, figsize=[20,15])

    for i, (parameter, unit) in enumerate(parameters):
        plot_weather_parameters(axes[i // 3, i % 3], data, parameter, unit)

    plt.suptitle('Bancroft weather parameters time series', fontsize=24)
    plt.tight_layout()
    # Update the legend to handle multiple lines
    handles, labels = axes[0, 0].get_legend_handles_labels()
    new_labels = [label.replace("avg_temp", "Average Temp.").replace("MA", "Moving Average") for label in labels]
    fig.legend(handles, new_labels, loc='upper right', ncol=3)

    plt.show()
    plt.savefig('../output/Bancroft weather parameters time series.png')

create_subplots(df_full)
<Figure size 2000x1500 with 0 Axes>

4.2.1. Time series analysis of weather data for the region Bancroft¶

In these plots, the blue line represents the actual data points, capturing the daily variability. The orange line indicates the weekly moving average, smoothing short-term fluctuations and making it easier to observe the weekly trends. The green line shows the monthly moving average, providing a clear view of the longer-term trends and seasonal patterns. The differences in color help distinguish the immediate data from its broader trends and cycles. Weekly Moving Averages (orange line), smooth out short-term fluctuations and make weekly trends easier to see. They still follow the daily data points (blue line) closely, but with fewer peaks, which is a reduction in daily variability. Monthly Moving Averages (green line) further smooth the curve and highlight longer-term trends and seasonal patterns. These are useful for understanding the general direction of temperature data over the year. They fluctuate less than the weekly MAs and can better show the summer peaks and winter troughs of the years.

average temperature, minimal wet bulb temperature, average dewpoint, average temperature change The first four diagrams illustrate temperature-related parameters for Bancroft: average temperature (avg_temp), minimal wet bulb temperature (min_wet_bulb_temp), average dewpoint (avg_dewpoint) and average temperature change (avg_temp_change), all measured in Kelvin. Each plot reveals a clear seasonal pattern, with one oscillation corresponding to one year, indicating a repetitive cycle of temperature variation across seasons. The first three plots show a sinusoidal-like pattern that repeats annually, with peaks representing the warmer summer months and troughs corresponding to the colder winter months. It seems that the coldest months are those, around New Year, while the warmest months are July and August. This suggests a strong seasonal influence on temperatures, as expected in most climatic regions. The weekly and monthly MAs in all three plots show similar seasonal patterns with highest values in summer and lowest values in winter, indicating a strong seasonal dependence of temperatures as well. They seem similar in their seasonal patterns as well, with their peaks and troughs aligning closely. This indicates a strong correlation between them. The consistency across the plots suggests that these temperature parameters are influenced by the same seasonal variables and possibly by each other.

The average temperature change (avg_temp_change) represents the hourly direction and Kelvin of temperature change. This plot shows a similar amount of repetitive cycles as well, indicating a correlation to the other temperature variables. The average temperature change plot shows, despite the daily fluctuations, that the general pattern of temperature changes remains stable over the year and is in line with the seasonal fluctuations of the other temperature parameters. It is interesting that the values are highest in the cold winter months and fall in the warm months. This suggests an inverse correlation with average temperature, minimal wet bulb temperature and average dewpoint.

Average Wind Speed and Maximum Wind Gusts The average wind speed (avg_windspd) and maximum wind gust (max_windgust) are quantified in meters per second. Both plots show substantial day-to-day variability with many peaks, which may correspond to transient weather events such as storms or frontal systems. The monthly moving averages in both wind speed plots do not show a clear seasonal trend, suggesting that wind speeds in Bancroft do not follow a simple seasonal pattern. Instead, they fluctuate throughout the year, possibly due to the region's complex weather dynamics. It is not possible to determine a trend or peaks and troughs across the years. However, a strong correlation between the average wind speeds and maximum wind gusts is seen. Days with higher average wind speeds tend to also experience more intense wind gusts, as shown by the concurrent peaks across both plots. The average wind speed plot seems volatile, the maximum wind gust plot reveals even a higher degree of volatility across the lines, with some gusts reaching well above the average range. This highlights the possible unpredictable nature of wind attributes, which can pose challenges for weather prediction and risk management.

Average Wind Direction (avg_winddir) Analysis Complementing the analysis of average wind speed and maximum wind gusts, the average wind direction offers insights into the directional aspect of wind patterns. The average wind direction is measured in degrees, indicating the direction from which the wind originates. The plot shows high day-to-day variability. There is no clear seasonal pattern in wind direction observable as well, suggesting that, like wind speeds, the direction for the region Bancroft is influenced by complex weather dynamics rather than a simple seasonal pattern. This might be attributed to the region's geography or the interaction between local and larger-scale climate patterns.

When looking at the diagrams for the average wind direction, the average wind speed and the maximum wind gusts, the high volatility of all three variables is striking. The average wind direction as a vector variable shows an even greater range of fluctuation, as each value represents a different actual direction and therefore no clear "average value" can be read off. This inherent volatility makes it difficult to visually determine whether there is a direct correlation between the variables. The high fluctuation in the data makes it difficult to identify high and low points. Although it seems possible that certain wind directions could correspond to stronger or weaker wind speeds and gusts - due to geographical or meteorological factors - a visual analysis alone is not sufficient to confirm a correlation. In order to be able to make a reliable statement about the existence of a correlation, further statistical tests are required. These could include cross-correlation analyses, which show how changes in one variable are related to changes in another over time.

Bancroft Maximum Cumulative Snow, Snow Density and Ice Formation Observing the maximum cumulative snow (in millimeters), snow density (in kilograms per cubic meter) and ice formation plot (in millimeters), it is noticeable that these plots display a pronounced correlation with seasonal cycles with peaks typically in the winter months. The blue line captures individual snow events, with the single-day variations giving way to a smoothed seasonal trend depicted by the weekly (orange) and monthly (green) moving averages. Notably, these plots show an inverse correlation to the temperature related graphs. The denser snow readings coincide with higher snowfall accumulations, suggesting that subsequent snowfalls contribute to the compaction and increase in snow density. The moving averages, particularly the green lines, underscore the annual rhythm of snowfall and its gradual compaction into denser snowpacks.

Bancroft Average Pressure Change The Average Pressure Change plot, illustrating the average pressure change (in Pascals), exhibits seasonal variation. Compared to the other variables, there is a clear correlation between temperature and snow variables. This plot looks almost identical to the "average temperature change" plot, which is a indicator for a high correlation. This means, that a high pressure change could lead to a high temperature change. The moving averages show a smoothing effect and highlight the strong seasonal pattern. This suggests that pressure changes in Bancroft are likely driven by seasonal influences that dictate like temperature, snow, and ice. Other variables like patternstransient and more complex weather systems could have a great influence as well. It is also apparent, that the data for average pressure change is missing for the year 2022. For some analysis including this parameter, the data for the year 2022 needs to ne dropped.

Insights: Understanding the Meteorological Dynamics of Bancroft

  • Seasonal Temperature Patterns: A distinct annual cyclicity in temperature parameters is evident, with higher values in the summer and lower in the winter. The moving averages, both weekly and monthly, effectively smooth out daily variances, reinforcing the observed seasonal trends.

  • Wind Characteristics: Wind speed and gusts display high day-to-day variability, lacking clear seasonal trends. This suggests that Bancroft's wind patterns are subject to complex and multifaceted weather dynamics rather than straightforward seasonal influences.

  • Variability in Wind Direction: Similar to wind speed, wind direction shows high variability with no discernible seasonal trend, indicating that local geographical and meteorological complexities significantly influence wind patterns.

  • Snowfall Patterns and Density: The snowfall data show a pronounced seasonal pattern, inversely correlated with temperature data. Peaks during the winter months align with the temperature lows, and the density of snow appears to increase with subsequent snowfalls, indicating compaction over time.

  • Pressure Changes: The average pressure change closely mirrors the temperature change plot, suggesting a strong relationship between atmospheric pressure and temperature, with seasonal factors being a significant driver of these variations.

It is important to keep in mind that the visualizations may not be the optimal method for identifying all types of correlations or patterns within the data. The visual analysis alone might not suffice to confirm the suspected correlation. Wind direction and wind speed is a good example due to the inherent volatility and the vector nature of wind direction data. Advanced statistical tests, like cross-correlation analysis, could be necessary to ascertain the relationships between variables more accurately.

4.2.2 Distribution, Frequency and Details for Weather Parameter Observations¶

Moving from the broad overview provided by the pie charts, the specifics are now zoomed in on with histograms. These will shed light on the distribution and frequency of weather parameters, offering a deeper understanding that's crucial for forecasting extreme weather conditions. In the next visualization, the corresponding historams are shown, which provide a quantitative view of the distribution of weather data.

In [10]:
def plot_histogram(data, parameter, unit, ax):
    # Plot the histogram
    sns.histplot(data[parameter], bins=30, kde=True, ax=ax, color='blue')
    # Calculate the mean of the parameter
    median_value = data[parameter].median()
    # Draw a vertical line for the median value
    ax.axvline(median_value, color='red', linestyle='dashed', linewidth=2)
    # Add a label for the line
    ax.text(median_value, ax.get_ylim()[1]*0.9, f'Median: {median_value:.2f}', color='red')

    # Set title and axis labels
    ax.set_title(f'Histogram for {parameter}')
    ax.set_ylabel('Frequency')
    ax.set_xlabel(unit)

# Set up the subplots with shared y-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=[20,15])

# Plot histograms for all parameters
for i, (parameter, unit) in enumerate(parameters):
    row, col = divmod(i, 3)
    plot_histogram(df, parameter, unit, axes[row, col])

# Manually set bottom margin to make x-axis visible
plt.subplots_adjust(bottom=0.1)  # Der Wert sollte positiv sein, um die Achsenbeschriftungen sichtbar zu machen.

# Show the plot
plt.suptitle('Bancroft weather parameters histograms', fontsize=24)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Angepasst, um den Titel und die Plots richtig darzustellen
plt.show()
plt.savefig('../output/Bancroft weather parameters histograms.png')
<Figure size 2000x1500 with 0 Axes>

Average Temperature (avg_temp): Histogram: Shows a left-skewed, bimodal distribution. This type of distribution suggests two predominant temperature states, often indicative of seasonal variations as seen with changing weather conditions over the year.

Peak: The plot shows a bimodal distribution, with a distinct peak near 295 Kelvin, indicating a frequently occurring temperature in the dataset, possibly representing a typical average temperature in summer. Another peak is around 275 Kelvin, possibly representing a typical average temperature in winter.

Median: The median value of 279.88 Kelvin is an indicator of the central tendency of the temperature distribution and is lower than the peak, reflecting the left-skewed distribution.

Skewness: The slight tilt to the right despite the left-skewed distribution means that while there are more low temperatures, the data are not heavily skewed towards the lower temperatures.

Dispersion: The seemingly moderate dispersion around the two peaks suggests that the temperatures in these areas are relatively consistent. The bimodal nature of the distribution shows that there are two distinct areas where temperatures most frequently occur. Observations deviating strongly from those peaks might indicate observations with extreme weather events.

Comparison to Time Series: The time series comparison confirms the seasonal patterns observed in the histogram. The data show a recurring tendency towards the same temperatures at certain times of the year, typical for climatic changes through the seasons.

Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dew Point (avg_dewpoint): Histogram: A left-skewed bimodal distribution, similar to that of the average temperature. The bimodality might indicate two dominant states likely reflecting seasonal fluctuations.

Peaks: The peaks lie between 285 and 290 Kelvin, indicating a common range for both variables, possibly an indicator of certain climatic conditions or seasons.

Median: The median values are approximately 274.25 Kelvin for the minimum wet bulb temperature and about 275.43 Kelvin for the average dew point. These values indicate a tendency towards cooler conditions in the data collection.

Dispersion: The moderate dispersions around the peaks suggest some consistency in climatic conditions. There's a clear seasonal dependence, as seen by the regular oscillation of values in the time series.

Comparison to Time Series: The time series plots confirm the bimodal peaks and the seasonal nature of the histograms. There is a clear seasonal oscillation, with peaks in summer and lower values in winter.

Average Temperature Change (avg_temp_change): Histogram: The distribution is symmetrically centered around the zero point, suggesting a bell shape. This implies no significant skewness and that daily temperature changes might be normally distributed.

Peak: The peak at 0 Kelvin suggests that no change in temperature is the most common state. This could point to a stable climate where extreme temperature fluctuations from day to day are not common.

Median: The median at 0.01 Kelvin is nearly zero, confirming the peak. The median reflects the central tendency in a symmetric distribution, here highlighting the stability of temperature changes.

Comparison to Time Series: The time series analysis shows that large daily temperature fluctuations are rare, confirmed by the concentration of data points around the null line in the histogram. The time series plots also show that most temperature changes are minor and oscillate around this null point.

Dispersion: Due to the symmetry of the distribution and the concentration of data near zero, the dispersion appears to be low. This means there are no large unexpected temperature changes, and the temperature remains quite stable on average.

Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): Histograms: Both histograms exhibit a right-skewed, unimodal distribution. This shape of distribution indicates that lower speed values occur more frequently than high ones.

Peak: The most common value (mode) for average wind speed is around 2 m/s, while that for maximum wind gusts is about 12.5 m/s. These values represent the most frequently occurring conditions for each of the two measurements.

Median: The median for average wind speed is 2.35 m/s and for maximum wind gusts 12.27 m/s. The median is higher than the mode, typical for a right-skewed distribution and indicating that the distribution of data has a larger number of lower speeds.

Skewness: The right skew of both distributions confirms that lower wind speeds and gusts are more common, and higher speeds occur less frequently.

Comparison to Time Series: The time series data show that high wind speeds and gusts occur sporadically and are not a constant state. There are peaks representing high wind events corresponding with the rarer higher values in the right-skewed histogram.

Dispersion: There appears to be a broad dispersion for both parameters, particularly for maximum wind gusts, which occasionally reach peak values. This shows that while low wind speeds are the norm, there is still a considerable number of occasions where the wind becomes strong enough to be classified as a gust.

Maximum Cumulative Precipitation (max_cumulative_precip): Histogram: The histogram shows a strongly right-skewed, unimodal distribution with a pronounced long tail extending to higher values. This means that high precipitation amounts do occur but are much less common than low amounts.

Peak: The mode is near zero, indicating that most observed periods have either no or very little precipitation.

Median: With a median of 1.53 mm, it is indicated that more than half of the precipitation events are below this value. The median is lower than the average due to the right skew.

Skewness: The strong right skew of the histogram confirms that low precipitation amounts are frequent and high amounts rare.

Comparison to Time Series: The time series shows that while high precipitation events occur irregularly, when they do occur, they can be substantial. This corresponds with the long tail of the histogram, indicating rare but extreme precipitation events.

Dispersion: The wide dispersion in the histogram reflects the variability in precipitation events, from very dry periods to rare heavy rainfalls.

Maximum Cumulative Snow (max_cumulative_snow) and Maximum Cumulative Ice (max_cumulative_ice): Histograms: The histograms for both variables show an extremely right-skewed, unimodal distribution, indicating a concentration of values near zero and that high values are very rare.

Peaks: Both histograms have their peaks near zero, indicating that most periods have either no snow or ice, or only very small amounts.

Median: A median of 0 mm for both variables indicates that the majority of observed events accumulate no snow or ice, highlighting the rarity of events with noticeable accumulation.

Skewness: The strong right skew in both histograms shows that while most time periods have no or minimal accumulation, the rare events with snow or ice formation can lead to extremely high values.

Comparison to Time Series: The time series plots confirm that high accumulations of snow and ice are restricted to a few events, typically occurring in the winter months. The moving averages (MAs) from the time series indicate that these high values do not persist and are generally short-lived events.

Maximum Snow Density (max_snow_density_6): Histogram: The histogram shows that nearly all measured values are near zero, indicating that high snow densities are very rare. The few outliers confirm the presence of occasional significant snow density events.

Median: A median of 0 kg/m^3 indicates that the vast majority of observations have no or very low snow density, underscoring the rarity of events with higher snow density.

Skewness: An extremely right-skewed distribution indicates that while most time periods have no or low snow density, the rare events with high snow density can lead to extremely high values.

Dispersion: The dispersion around the zero point appears to be low, indicating that extreme pressure changes are unusual and that atmospheric pressure in this region remains relatively stable.

Comparison to Time Series: The time series plots show little variation in air pressure over time, which is confirmed by the histogram. Most changes are small and there are no long-term trends of increasing or decreasing pressure.

The climate in Bancroft is characterized by a bimodal and seasonal temperature distribution, which indicates significant weather changes throughout the year. The temperature shows no extreme daily fluctuations. Unimodal, low wind speeds are more common than high wind gusts. Precipitation events are mostly low, but there are occasional intense rainfalls. Snow and ice accumulations are rare, but occur sporadically in significant quantities. Atmospheric pressure remains relatively stable, with little long-term fluctuation. Overall, the analysis indicates a climate that is dominated by seasonal patterns and is generally moderate and stable, with a tendency to extreme weather, which includes snow, ice, cold and high winds, which was to be assumed for Canadian climate.

The frequencies of variable values are presented in the following scatterplot. This allows for a quick identification of which variables occur particularly rarely or frequently in a given expression.

In [11]:
def plot_dotplot(data, parameter, unit, ax):
    sns.stripplot(x=data[parameter], ax=ax, color='blue', size=8, jitter=True)
    ax.set_title(f'Dot Plot for {parameter}')
    ax.set_ylabel('Frequency')
    ax.set_xlabel(unit)

# Set up the subplots with shared y-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=[20,15])

# Plot dot plots for all parameters
for i, (parameter, unit) in enumerate(parameters):
    row, col = divmod(i, 3)
    plot_dotplot(df, parameter, unit, axes[row, col])

# Manually set bottom margin to make x-axis visible
plt.subplots_adjust(bottom=-0.1)

# Show the plot
plt.suptitle('Bancroft weather parameters dot plots', fontsize=24)
plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft weather parameters dot plots.png')
<Figure size 2000x1500 with 0 Axes>

The dot plots for the weather parameters of Bancroft provide a detailed view of data frequency across various metrics. These diagrams are useful for quickly identifying rarer and more frequent events and offer an additional dimension to the understanding of histogram and time series analyses.

Average Temperature (avg_temp): The points are widely scattered across a range of values, indicating a variety of temperatures throughout the year. This corresponds with the bimodal patterns in the histogram and seasonality in the time series.

Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dew Point (avg_dewpoint): The point distributions are similar to average temperature and confirm the bimodal and seasonal fluctuations.

Average Temperature Change (avg_temp_change): The even distribution of points around the zero point indicates that significant daily temperature changes are rare, consistent with the bell-shaped histograms.

Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): The points concentrate at lower speeds, reflecting the right skew observed in the histogram and the rare peaks in the time series for higher speeds.

Maximum Cumulative Precipitation (max_cumulative_precip): The concentration of points near zero confirms the right skew observed in the histogram and the sporadic peaks in the time series plots for high precipitation events.

Maximum Cumulative Snow (max_cumulative_snow) and Ice Accumulation (max_cumulative_ice): The points are predominantly concentrated at the lower end of the scale, highlighting the rarity of events with higher accumulations, corresponding with the extremely right-skewed histograms.

Maximum Snow Density (max_snow_density_6): The distribution of points near zero suggests that high snow densities are rare, as indicated in the histogram and time series data.

Average Pressure Change (avg_pressure_change): The points are evenly distributed around the zero value, indicating a generally stable pressure environment, which is also confirmed by the histogram and time series plots.

In [12]:
# Set up the subplots with shared x-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 15), constrained_layout=True)

# List of relevant weather parameters
relevant_parameters = ['avg_temp', 'min_wet_bulb_temp', 'avg_dewpoint', 'avg_temp_change', 'avg_windspd', 'max_windgust',
                       'avg_winddir', 'max_cumulative_precip', 'max_snow_density_6', 'max_cumulative_snow',
                       'max_cumulative_ice', 'avg_pressure_change']

sns.set(style="darkgrid")

# Loop through each parameter and create a boxplot
for i, parameter in enumerate(relevant_parameters):
    sns.boxplot(x=df[parameter], ax=axes.flatten()[i], color='blue')
    axes.flatten()[i].set_title(f'Box Plot for {parameter}')
    axes.flatten()[i].set_ylabel(parameter)

# Adjust layout
plt.suptitle('Bancroft weather parameters box plots', fontsize=24)
plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft weather parameters box plots.png')
<Figure size 2000x1500 with 0 Axes>

Average Temperature (avg_temp): The boxplots display a wide distribution with a central tendency around the median, confirming the seasonal fluctuations already visible in the histograms and time series.

Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dewpoint (avg_dewpoint): The boxplots suggest a similar distribution, with a notable range and central tendency, consistent with the seasonal patterns in the other diagrams.

Average Temperature Change (avg_temp_change): The boxplots indicate that most changes are close to zero, suggesting low daily variability, as also seen in the Dot Plots and time series.

Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): The boxplots show that wind speeds are mostly in the lower range, with some outliers representing the rare higher speeds, as demonstrated in the Dot Plots and histograms.

Maximum Cumulative Precipitation (max_cumulative_precip): The boxplot reveals a concentration of data near zero with some high outliers, confirming the sporadic heavy rains visible in the histograms and time series.

Maximum Snow Density (max_snow_density_6), Maximum Cumulative Snowfall (max_cumulative_snow), Maximum Cumulative Ice Accumulation (max_cumulative_ice): The boxplots illustrate that high values are rare, with many data points clustering near zero, indicating generally low snow and ice accumulation, similar to observations in the Dot Plots and histograms.

Average Pressure Change (avg_pressure_change): The boxplots show a distribution centered around zero with some outliers on both sides, indicating a generally stable atmospheric pressure distribution, consistent with the other types of diagrams.

In the next step the variables are visualized together with their corresponding weather events.

In [13]:
def plot_wep_features_distogram(df, label_col):

    f, axes = plt.subplots(4, 3, constrained_layout = True, figsize=[20,15])
    sns.set(style="darkgrid")
    for i in df[label_col].unique():
        df_cluster_distogram = df.loc[df[label_col] == i]
        axes = axes.ravel()
        num_label = df_cluster_distogram[label_col].value_counts()[i]
        for j, feature in enumerate(relevant_parameters):
            sns.distplot(df_cluster_distogram[feature], label=f'{i}: {num_label}', ax=axes[j])
            
    f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Distograms", fontsize=24)
    
    axes[0].set(title="Avg temperature in Kelvin", xlim=(230, 310), xlabel="")
    h, l = axes[0].get_legend_handles_labels()
    axes[1].set(title="Min wet bulb temperature in Kelvin", xlim=(230, 310), xlabel="")
    axes[2].set(title="Avg dewpoint in Kelvin", xlim=(230, 310), xlabel="")
    axes[3].set(title="Avg temperature change in Kelvin", xlim=(-1, 1), xlabel="")
    axes[4].set(title="Avg wind speed in m/s", xlim=(0, 10), xlabel="")
    axes[5].set(title="Max windgust in m/s", xlim=(0, 30), xlabel="")
    axes[6].set(title="Avg winddirection in degrees", xlim=(0, 360), xlabel="")
    axes[7].set(title="Max cumulative precipitation in mm", xlim=(0, 100), ylim=(0, 0.1), xlabel="")
    axes[8].set(title="Max snow density (when snow exceeds 6.35mm in kg/m^3)", xlim=(0, 800), ylim=(0, 0.01), xlabel="")
    axes[9].set(title="Max cumulative snow accretion in mm", xlim=(0, 1000), ylim=(0, 0.009), xlabel="")
    axes[10].set(title="Max cumulative ice accretion in mm", xlim=(0, 10), ylim=(0, 0.8), xlabel="")
    axes[11].set(title="Avg pressure change in Pa",  xlim=(-200, 200), xlabel="")  
    f.legend(h, l, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=5)
    

data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_distogram(data, 'wep')
plt.savefig('../output/Bancroft_wep_parameter_distograms.png')

Overview These distograms provide a statistical representation of weather data, illustrating the density and distribution of meteorological features categorized by WEPs. Each plot represents a different weather feature, and the overlay of kernel density estimation curves facilitates the visualization of data distribution trends.

Analysis of Distogram Features Temperature Metrics (Average, Minimum Wet Bulb, Dewpoint in Kelvin): These metrics show multimodal distributions, with peaks representing common temperature ranges. The presence of multiple peaks can indicate the occurrence of distinct weather patterns or seasonal variations.

Temperature Change (Average Change in Kelvin): The spread of data around zero indicates frequent temperature stability, but noticeable deviations suggest periods of significant warming or cooling. Skewness in this distogram would indicate a tendency towards one or the other.

Wind Characteristics (Average Speed, Maximum Gust in m/s): Wind speed distograms are skewed towards lower speeds, with long tails extending to higher speeds, indicating that while calm conditions are common, gusty conditions do arise, albeit less frequently.

Wind Direction (Average Direction in Degrees): The relatively flat distribution across degrees indicates no dominant wind direction, suggesting variable wind patterns across the different WEPs.

Precipitation (Maximum Cumulative in mm): The distribution here is right-skewed, indicating that lower precipitation events are more common, with fewer but significant heavy precipitation events.

Snow Metrics (Maximum Density, Cumulative Snow Accretion in mm): These plots reveal a right-skew, especially for snow density, suggesting that while light snowfall is common, events with dense snow accumulation are rare but significant when they occur.

Ice Accumulation (Maximum Cumulative in mm): A similar right-skewed distribution to snow, indicating that ice accumulation is generally low, with occasional extreme icing events.

Pressure Variability (Average Change in Pa): The distribution of pressure changes appears to be approximately normally distributed, indicating that significant pressure drops or spikes are unusual.

Interpretation and Real-World Implications Temperature Metrics: The various modes in temperature distograms may correspond to different seasonal conditions, with the specific peaks corresponding to typical temperatures for those times of year.

Temperature and Pressure Changes: Large swings in temperature and pressure can be associated with frontal systems, which are critical for weather forecasting and can have significant implications for agriculture and energy usage.

Wind Patterns: Understanding the distribution of wind speed and direction is crucial for aviation, marine navigation, and the operation of wind farms.

Precipitation and Snow/Ice Accumulation: These metrics are vital for water resource management, agriculture, and urban planning, especially in anticipating flood risks or the impact of heavy snowfall on infrastructure.

Skewness and Tail Analysis: The presence of skewness and heavy tails in the distograms for precipitation, snow, and ice accumulation suggests the potential for extreme weather events, which are critical for disaster preparedness and risk management.

Concluding Remarks The distograms offer a nuanced view of the probability density of various weather conditions, providing valuable insights for meteorologists, environmental scientists, and policy makers. The observed multimodality, skewness, and the presence of heavy tails in the distributions underline the complexity of weather dynamics and the importance of preparing for a range of weather conditions, from the most common to the extreme. Quantitative analysis, such as calculating skewness, kurtosis, and conducting hypothesis tests, would further enhance the understanding and interpretation of these weather patterns.

The plots show that some of the variables occuring more often at special weather events. For example on days with high temperatures there is often blue sky. At low temperatures, snowstorms with high percipitiation seem to occur. However, it is important to note that, at this point no statements about causality can be made. The graphic only reveals correlations.

Here, the frequencies of various variables are once again depicted. The points are colored according to weather events. This allows for conclusions to be drawn regarding the occurrence of specific aspects during different weather events.

In [14]:
def plot_wep_features_dotplot(df, label_col, relevant_parameters):
    unique_labels = df[label_col].unique()
    palette = sns.color_palette() 
    label_colors = dict(zip(unique_labels, palette))

    f, axes = plt.subplots(4, 3, constrained_layout=True, figsize=[20, 15])
    sns.set(style="darkgrid")

    axes = axes.ravel()
    for i in unique_labels:
        df_cluster_dotplot = df.loc[df[label_col] == i]
        for j, feature in enumerate(relevant_parameters):
            sns.stripplot(x=df_cluster_dotplot[feature], color=label_colors[i], ax=axes[j])

    f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Dotplots", fontsize=24)
    patches = [plt.Line2D([0], [0], color=label_colors[label], marker='o', linestyle='', markersize=10) for label in unique_labels]
    f.legend(patches, unique_labels, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=len(unique_labels))

    axes[0].set(title="Avg temperature in Kelvin", xlim=(230, 310), xlabel="")
    axes[0].legend(loc="best")
    axes[1].set(title="Min wet bulb temperature in Kelvin", xlim=(230, 310), xlabel="")
    axes[2].set(title="Avg dewpoint in Kelvin", xlim=(230, 310), xlabel="")
    axes[3].set(title="Avg temperature change in Kelvin", xlim=(-1, 1), xlabel="")
    axes[4].set(title="Avg wind speed in m/s", xlim=(0, 10), xlabel="")
    axes[5].set(title="Max windgust in m/s", xlim=(0, 30), xlabel="")
    axes[6].set(title="Avg winddirection in degrees", xlim=(0, 360), xlabel="")
    axes[7].set(title="Max cumulative precipitation in mm", xlim=(0, 100), ylim=(0, 0.1), xlabel="")
    axes[8].set(title="Max snow density (when snow exceeds 6.35mm in kg/m^3)", xlim=(0, 800), ylim=(0, 0.01), xlabel="")
    axes[9].set(title="Max cumulative snow accretion in mm", xlim=(0, 1000), ylim=(0, 0.009), xlabel="")
    axes[10].set(title="Max cumulative ice accretion in mm", xlim=(0, 10), ylim=(0, 0.8), xlabel="")
    axes[11].set(title="Avg pressure change in Pa",  xlim=(-200, 200), xlabel="")  

data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_dotplot(data, 'wep', relevant_parameters)
plt.savefig('../output/Bancroft_wep_parameter_dotplots.png')
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Average Temperature: The distribution clearly shows that days with clear skies (blue) tend to have higher temperatures, while snow days (green, red, brown) usually fall in the cooler temperature range. Snowstorms with heavy precipitation also tend to result in lower temperatures, while storms with freezing rain/heavy snow and ice storms tend to have more moderate temperatures. Furthermore, the distribution shows that clear sky days can also have the lowest temperatures, which may be an initial indicator that weather may only be partially related to temperature.

Minimum Wet Bulb Temperature and Average Dew Point: These parameters follow a similar pattern to average temperature, with lower values often associated with snowfall (green, red, brown), and higher temperatures occurring more frequently on clear days (blue).

Average Temperature Change: The slight variation in temperature change across different weather events suggests a relatively stable thermal environment, regardless of weather conditions.

Average Wind Speed and Maximum Wind Gust: Lower wind speeds seem to occur frequently on clear days (blue), while higher speeds appear to be associated with moderate snowfall or more stormy conditions (gray, pink). Wind gusts show a wide dispersion across various weather events, making it difficult to draw conclusions.

Average Wind Direction: The wide dispersion across all directions appears to show no significant correlation with specific weather events, indicating diverse wind patterns.

Maximum Cumulative Precipitation: The lowest values are associated with moderate rain (purple) and then moderate snowfall. Clear sky days (blue) seem to have the highest values but become increasingly rare as the value increases. However, the highest values are only seen on clear sky days.

Maximum Snow Density: Logically, only snow events are considered (red, brown, green, gray, pink). Moderate snowfall accounts for the lowest values. High snow densities are rare and seem to correlate with snowstorms (gray, green), as can be seen from the isolated points in these colors.

Maximum Cumulative Snowfall and Maximum Cumulative Ice Accumulation: These parameters clearly show that days with heavy snowfall (gray) or freezing rain (orange) lead to higher accumulations, while clear days (blue) and light snowfall (red, brown) show little accumulation.

Average Pressure Change: The distribution of pressure changes shows no clear correlation to specific weather events, suggesting that pressure changes might be relatively stable regardless of the type of weather. However, it can be seen that continuous freezing rain (orange), mild snowfall (red), heavy snowfall with accumulated snow (green), and clear sky days (blue) tend to appear on the edges and thus occur in combination with a high average change in pressure.

In [15]:
def plot_wep_features_boxplot(df, label_col, relevant_parameters):
    unique_labels = df[label_col].unique()
    palette = sns.color_palette()
    label_colors = dict(zip(unique_labels, palette))

    f, axes = plt.subplots(4, 3, constrained_layout=True, figsize=[20, 15])
    sns.set(style="darkgrid")

    for j, feature in enumerate(relevant_parameters):
        ax = axes.flatten()[j]
        sns.boxplot(x=df[label_col], y=df[feature], ax=ax, palette=label_colors)

        ax.set_title(f'Box Plot for {feature}')
        ax.set_ylabel(feature)
        ax.set_xlabel('')
        ax.set_xticks([])
        ax.set_xticklabels([])
    f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Boxplots", fontsize=24)
    handles = [plt.Line2D([0], [0], color=label_colors[label], marker='o', linestyle='', markersize=10) for label in unique_labels]
    f.legend(handles, unique_labels, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=len(unique_labels))
    plt.tight_layout()
    plt.show()

data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_boxplot(data, 'wep', relevant_parameters)
plt.savefig('../output/Bancroft_wep_parameter_boxplots.png')
<Figure size 2000x1500 with 0 Axes>

Comprehensive Weather Analysis Through Boxplot Visualization

The suite of boxplots offers an extensive overview of weather features associated with different WEPs, organized into rows and columns, each depicting a distinct meteorological attribute. These boxplots reveal the extensive variability and presence of outliers across temperature, wind, and precipitation features, suggesting a spectrum of conditions from stable to extreme.

Temperature-Related Features

  • Average Temperature (avg_temp): Demonstrates significant variability across WEPs, with outliers highlighting extreme temperatures.
  • Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dew Point (avg_dewpoint): Display a wide range of values, indicating different moisture and temperature conditions.

Temperature and Pressure Variations

  • Temperature Change (avg_temp_change) and Pressure Change (avg_pressure_change): Show notable numbers of outliers, indicative of high variability possibly due to frontal systems or other dynamic meteorological events.

Wind Features

  • Average Wind Speed (avg_windspd), Maximum Wind Gust (max_windgust), and Average Wind Direction (avg_winddir): Exhibit significant variability, reflecting fluctuating wind conditions with less variability in wind direction, suggesting prevailing patterns.

Precipitation

  • Maximum Cumulative Precipitation (max_cumulative_precip), Snow, and Ice Accumulation: Considerable variation with wide spreads and high outliers particularly in snow and ice boxplots, indicative of intense events.

Snow Density

  • Maximum Snow Density (max_snow_density_6): Displays high variability and outliers, suggesting diverse snowfall types or wet snow events.

Relationships Between Boxplots Complex interactions between various weather conditions are suggested, including potential inverse correlations between temperature and precipitation forms, associations between strong winds and temperature fluctuations, and correlations between higher wind speeds and intense precipitation events.

Insights

  • Clear Skies (Blue): High average temperatures with low precipitation, typical of clear conditions.
  • Continuous Freezing Rain (Orange): Tight temperature ranges and variable pressure changes indicative of freezing rain conditions.
  • Light to Heavy Snowfall (Red to Green): Low temperatures with moderate to high snow and ice accumulations, typical of varying snowfall intensities.
  • Moderate Rain and Snow (Purple and Brown): Demonstrates variability in temperature changes and precipitation amounts.
  • Severe Storms (Pink and Grey): Marked by extreme precipitation and snow amounts, significant temperature and pressure fluctuations, and variable wind conditions.

In the analysis of the present assignment, temperature and wind are of particular importance. These variables will be examined more closely.

4.2.3. Further Examination of WEP and Temperature Response Variables¶

In [16]:
# If 'Blue sky day' is present in 'wep', use 'label0'; otherwise, use 'wep'
df['display_label'] = df.apply(lambda row: f"{row['wep']}: '{row['label0']}'" if row['wep'] == 'Blue sky day' else row['wep'], axis=1)

# Count occurrences of weather events
weather_counts = df['display_label'].value_counts()

# Count occurrences of weather events excluding 'Blue sky day'
not_blue_sky_day_counts = df[df['wep'] != 'Blue sky day']['wep'].value_counts()

# Determine events to be exploded
explode_events = set(not_blue_sky_day_counts.index)

# Create explode list for the first chart
explode = [0.1 if event in explode_events else 0 for event in weather_counts.index]

# Total number of all samples
total_samples = len(df)

# Create the plot
plt.figure(figsize=(16, 8))

# Group categories and adjust labels
lower_half_threshold = weather_counts.quantile(0.5)
grouped_labels = [label if weather_counts[label] > lower_half_threshold else 'Other' for label in weather_counts.index]

# Pie chart for all weather events
plt.subplot(1, 2, 1)
patches, texts, autotexts = plt.pie(weather_counts, labels=grouped_labels, autopct=lambda p: f'{p:.1f}%\n({int(p/100*total_samples)})', startangle=90, explode=explode)
plt.title(f'Distribution of All Weather Events\nTotal Number of Observations: {total_samples}')

# Adjust text size and hide overlapping labels
for text, autotext in zip(texts, autotexts):
    if text.get_text() == 'Other':
        text.set_visible(False)
    text.set_fontsize(8)
    autotext.set_fontsize(8)

# Enlarged view for 'wep' != 'Blue sky day'
plt.subplot(1, 2, 2)
total_not_blue_samples = not_blue_sky_day_counts.sum()
plt.pie(not_blue_sky_day_counts, labels=not_blue_sky_day_counts.index, autopct=lambda p: f'{p:.1f}%\n({int(p/100*total_not_blue_samples)})', startangle=90)
plt.title(f'Distribution of Extreme Weather Events\nTotal Number of Non-Blue-Sky Observations: {total_not_blue_samples}')
plt.show()
plt.savefig('../output/Distribution of weather events pie chart.png')
<Figure size 2000x1500 with 0 Axes>

Weather Condition Frequencies¶

  • Blue Sky Day: The most common weather condition, with a total of 52,633 days recorded. This indicates clear weather without significant cloud cover, snow, ice, high precipitation, unusual temperatures or adverse weather events. The Blue Sky Days are to be understood as a "common or regular weather day".
  • Mild Snowfall: Occurred on 4,497 days, representing light snow that did not result in significant accumulation or disruption
  • Moderate Snowfall: Recorded on 2,920 days, representing a greater intensity of snowfall that caused some accumulation.
  • Moderate Rain: Seen on 2,630 days, indicative of rainfall that is neither too light nor excessively heavy.
  • Heavy Snowfall with Accumulated Snow: This condition was noted on 1,040 days, characterized by substantial snowfall leading to significant snow accumulation.
  • Frontdurchlauf / Continuous Freezing Rain: Occurred on 934 days. The term 'Frontdurchlauf' suggests a weather front causing persistent freezing rain.
  • Storm with Freezing Rain / Heavy Snow- and Ice Storm: A severe weather condition that happened on 430 days, involving a combination of freezing rain and heavy snow or ice, potentially leading to hazardous conditions.
  • Snowstorm with High Precipitation: The least frequent, observed on 261 days, indicating snowstorms with significant snowfall amounts.

The data shows a clear predominance of weather events, with 'Blue sky day' being the majority. However, there is a notable frequency of various forms of precipitation, particularly snowfall, which can have implications for activities such as transportation, agriculture, and city planning. The variance in snow-related conditions from mild to heavy suggests a range of winter weather that would require different levels of response from municipal services.

Overall Weather Event Distribution:¶

The first chart details the proportion of different weather events and consists of 65.345 observations. Most observations, about 31.6%, have been classified under a category "Blue sky day: 0", suggesting a predominant weather type in the dataset. 'Blue sky day' occurrences are differentiated into five sub-categories ('0', '1', '2', '3', '4'), which appear with a huge difference in their frequency. Together, they account for more than 77.9 percent of all observations, indicating that clear skies are a common condition. The remaining weather events such as moderate rain, mild and moderate snowfall, and other less frequent events are highlighted, since they show the main part of this EDA.

Extreme Weather Event Distribution:¶

The second chart focuses on the distribution of extreme weather events, excluding the 'Blue sky day'. This subset consists of 12.712 observations, which is approximately 19.5% of the total dataset, indicating that extreme weather events are relatively less common. Within this subset, moderate snowfall is the most common, making up 35.4% of non-clear sky observations. Moderate snowfall (23.0 %) and moderate rain (20.7 %) are the next most common events. Those three accumulated make up to 79.1 percent, the remaining 20.9 percent are the more extreme events. The more extreme events include:

  1. Heavy snowfall with accumulated snow
  2. Frontdurchlauf / Continuous freezing rain
  3. Storm with freezing rain / Heavy snow- and icestorm
  4. Snowstorm with high precipitation,
  5. </ol> </br> sorted by their frequency.
    These events are categorized as extreme, because they can significantly impact the environment and human activities.

In [17]:
df_datetime = df_full.copy()

# Grouping the data by year and month and calculating the average in Celsius
grouped_monthly_avg_celsius = df_full.groupby([df_full['run_datetime'].dt.year, df_full['run_datetime'].dt.month])['avg_temp_celsius'].mean()

# Conversion of the grouped data into a format suitable for the radar type
radar_data_celsius = grouped_monthly_avg_celsius.unstack(level=1)

# Selecting the color chart
cmap = plt.cm.coolwarm

# Determining the temperatures in July and scaling for the color chart
july_temps = radar_data_celsius.loc[:, 7]  # 7 steht für Juli
min_july_temp = july_temps.min()
max_july_temp = july_temps.max()
scaled_july_temps = (july_temps - min_july_temp) / (max_july_temp - min_july_temp)
colors = [cmap(temp) for temp in scaled_july_temps]  # Liste der Farben

# Creating the radar chart with Celsius data
angles_celsius = [n / 12 * 2 * math.pi for n in range(12)]  # 12 Monate
angles_celsius += angles_celsius[:1]  # Schließt den Kreis

# Month names
month_names = ["Jan", "Feb", "Mär", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez"]

# Creating the radar chart
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
for i, year in enumerate(radar_data_celsius.index):
    values_celsius = radar_data_celsius.loc[year].tolist()
    values_celsius += values_celsius[:1]  # Closing the circle
    ax.plot(angles_celsius, values_celsius, linewidth=1, linestyle='solid', label=f'Year {year}', color=colors[i])

# Add month names as labels
plt.xticks(angles_celsius[:-1], month_names, color='grey', size=12)

plt.title('Temperature change over the months of all years in the dataset (in Celsius)', size=20, y=1.1)
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
plt.show()
plt.savefig('../output/Temperature change radar chart.png')
<Figure size 2000x1500 with 0 Axes>

The radar chart visualizes the average temperature data derived from the ERA5 weather dataset for Bancroft over the period 2016 to 2022. Each of the 12 axis represents a month, with temperature change values plotted along the axes. The average temperature was converted from Kelvin to Celsius for better visual representation.

Techniques Utilized:

  • Data Aggregation: Monthly averages provide a simplified yet informative view of temperature trends, smoothing out daily variability and focusing on larger patterns.
  • Normalization: By normalizing the data around the mean, the plot effectively highlights deviations, which is crucial for identifying months that diverge from the expected seasonal patterns.

Key Insights from the Radar Chart:

  • Seasonal Trends: The consistency of the patterns across different years confirms the cyclical nature of Bancroft's climate with clear distinctions between the warmer and colder months.
  • Yearly Fluctuations: Variations in the shape of the lines for each year can indicate minor shifts in climate patterns, such as an overall increase or decrease in average temperatures which might be tied to broader trends like global warming. The plot shows no consistent increase in temperature over the years, which does not mean that there is no global warming effect.
  • Extreme Weather Events: Sharp deviations from the mean in specific months could signal extreme weather events. For example, a particularly sharp peak in one year could indicate a heatwave, while a significant dip might correspond to an unusually cold month. The plot shows a few heat- and cold waves across different years, further analysis could help to understand this fluctuations.
  • Climatic Anomalies: Comparing the temperature profiles of consecutive years can reveal climatic anomalies. A year that consistently shows higher temperatures across all months could be indicative of a long-term warming trend. With the vizualisation alone it is hart to tell, if there is a year which has constantly higher oder lower temperatures. Further analysis could reveal more.
In [18]:
weather_counts = df['wep'].value_counts()
print(weather_counts)
wep
Blue sky day                                           52633
Mild snowfall                                           4497
Moderate snowfall                                       2920
Moderate rain                                           2630
Heavy snowfall with accumulated snow                    1040
Frondurchlauf / Continuous freezing rain                 934
Storm with freezing rain / Heavy snow- and icestorm      430
Snowstorm with high precipitation                        261
Name: count, dtype: int64
In [19]:
# Convert the 'run_datetime' column to a datetime object
df['run_datetime'] = pd.to_datetime(df['run_datetime'])

# Aggregate the average wind speed and temperature for each month across all years
avg_by_month = df.groupby(df['run_datetime'].dt.to_period('M')).agg({
    'avg_windspd': 'mean',
    'avg_temp': 'mean'
}).reset_index()

# Ensure the presence of a date column
avg_by_month['Month'] = avg_by_month['run_datetime'].dt.strftime('%b')

# Adjust the order of the months
months_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
avg_by_month['Month'] = pd.Categorical(avg_by_month['Month'], categories=months_order, ordered=True)
avg_by_month = avg_by_month.sort_values('Month')

# Plot as a Line Chart
fig, ax1 = plt.subplots(figsize=(15, 8))

# Line for wind speed on the left y-axis
color = 'tab:blue'
ax1.set_xlabel('Month')
ax1.set_ylabel('Average windspeed (m/s)', color=color)
ax1 = sns.lineplot(x='Month', y='avg_windspd', data=avg_by_month, sort=False, label='Wind Speed', color=color)
ax1.tick_params(axis='y', labelcolor=color)

# Second y-axis for temperature
ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Average temperature (°C)', color=color)
ax2 = sns.lineplot(x='Month', y='avg_temp', data=avg_by_month, sort=False, label='Temperature', color=color)
ax2.tick_params(axis='y', labelcolor=color)

plt.title('Average Wind Speed and Temperature per Month Across All Years')

# Adding the legend manually
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
plt.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper right')

plt.show()
plt.savefig('../output/verage wind speed and temperature per month across all years.png')
<Figure size 2000x1500 with 0 Axes>

This time series analysis provides insightful revelations into the climate of Bancroft, examining the average monthly temperature and wind speed.

Key Findings Average Temperature

  • Seasonal Dependence: Clear pattern with highs in summer and lows in winter.
  • Annual Cycle: Temperatures rise from January, peaking in July, then decreasing until December. Average Wind Speed
  • Seasonal Variance: Increases in spring, peaks in summer, and decreases in winter.
  • Meteorological Implications: Highest speeds in July, indicating intensified activities. Correlation Between Temperature and Wind Speed
  • General Trends: Increase in wind speed during summer correlates with higher temperatures.
  • Thermal Effects: Possibly due to increased land warming during summer months.

Detailed Analysis Temperature and Wind Speed Intersections

  • Transitional Phases: Indicated by the intersection of average wind speed and temperature curves.
  • Seasonal Indicators: These intersections serve as crucial markers for the start or end of seasons. Wind Speed (Blue Line):
    • Peaks: Notably in March (windiest) and November.
    • Troughs: Lowest average wind speed in August. Temperature (Red Line):
    • Annual Cycle: Peaks in July or August, demonstrating a clear sinusoidal pattern.
    • Inverse Relationship: Notably from May through August with temperature peaks and wind speed troughs.

Interpretation and Conclusions

  1. Seasonal Patterns: Peaks in wind speed may relate to specific weather patterns like monsoons or trade winds.
  2. Inverse Relationship: Evident during certain months, suggesting complex interactions between temperature and wind.
  3. Synergy Towards Year's End: Both wind speed and temperature decline together, indicating onset of winter.

This analysis already brings us to the analysis of relationships between variables, which will be disgussed in the following chapter.

4.3 Relationships¶

In [20]:
# Display a scatter plot in a diagram with subplots for all parameter pairs
fig, axes = plt.subplots(len(relevant_parameters), len(relevant_parameters), figsize=(50, 50), gridspec_kw={'hspace': 0.8, 'wspace': 0.4})

# Function for calculating the correlation coefficient and displaying the scatterplot
def plot_scatter_and_correlation(ax, x, y, data):
    sns.scatterplot(x=x, y=y, data=data, ax=ax)
    correlation_coefficient = data[x].corr(data[y])
    ax.set_title(f'Scatterplot of {x} and {y}', fontsize=16)
    ax.set_xlabel(x, fontsize=14)
    ax.set_ylabel(y, fontsize=14)
    ax.text(0.5, -0.27, f'Correlation Coefficient: {correlation_coefficient:.2f}', transform=ax.transAxes, ha='center', fontsize=12)

# Loop over all parameters 
for i, feature_x in enumerate(relevant_parameters):
    for j, feature_y in enumerate(relevant_parameters):
        if i != j:  # Damit die Diagonale übersprungen wird (x gegen x)
            plot_scatter_and_correlation(axes[i, j], feature_x, feature_y, df)

plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft Association Plots.png')
<Figure size 2000x1500 with 0 Axes>

Exploring Parameter Associations and Correlations This chapter delves into the significance of association and correlation analyses, the methodologies employed, and the nuanced interpretation of results. Additionally, it addresses the rationale behind incorporating temperature and wind parameters in regression analyses, despite their lack of direct correlation.

Significance of Association and Correlation Association and correlation analyses are fundamental tools in uncovering relationships between different variables within a dataset. These analyses provide insights into the strength and direction of connections, helping identify patterns, dependencies, and potential factors influencing the phenomena under study.

To assess the associations and correlations among meteorological parameters, scatterplots were plotted and the correlation coefficients calculated. The scatterplots visually display the relationship between two variables, offering an initial understanding of their potential connection. Subsequently, correlation coefficients (Pearson correlation coefficient) quantify the strength and direction of the linear relationship between variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Comparing Parameter Correlations In the exploration, the correlation between all pairs of meteorological parameters were systematically compared. This exhaustive approach aimed to uncover not only direct associations but also potential indirect relationships and dependencies that might influence weather patterns.

A noteworthy observation from the analyses is that variables with similar scales tend to exhibit stronger correlations (e.g., avg_temp, avg_dewpoint, avg_temp_change, min_wet_bulb_temp as well as avg_windspd and max_windgust). This finding underscores the importance of considering the units and magnitudes of variables when interpreting correlation coefficients. Variables measured on comparable scales are more likely to show coherent patterns and relationships. An intriguing aspect of the analysis was the dynamic nature of correlations, particularly in response to extreme weather events. The occurrence of extreme weather events was found to alter the correlation between variables, suggesting that these events could act as influential factors in shaping the relationships among meteorological parameters.

Despite not exhibiting a direct correlation, temperature and wind parameters were intentionally included in the following regression analyses. The decision stemmed from the recognition that their relationship might be nonlinear or influenced by additional factors not captured by linear correlation. The exploration of multiple predictor regression analyses sought to unveil nuanced interactions among these parameters, going beyond simple linear associations. The inclusion of temperature and wind parameters in multiple predictor regression analyses aimed to unravel complex dependencies and potential interactions that might contribute to temperature variations. While standalone correlation might not capture these intricate relationships, the multiple predictor regression framework allows for a more holistic understanding of the combined impact of various factors on the target variable.

Conclusion In conclusion, the association and correlation analyses provided valuable insights into the interconnectedness of meteorological parameters. The consideration of variable scales, the dynamic influence of extreme weather events, and the rationale for including seemingly uncorrelated variables in regression analyses contribute to a more nuanced understanding of atmospheric dynamics in Bancroft. This chapter serves as a foundation for the subsequent exploration of multiple predictor regression analyses, where the goal is to decipher the complex interplay among meteorological parameters and their collective impact on weather patterns.

This analysis should also be conducted seperately for all different weps.

In [21]:
def plot_wind_speed_and_avg_temp_by_direction(df_full, df, wind_direction_label, avg_temp_label, avg_windspd_label):
    # Calculate the average temperature for each wind direction
    avg_temp_by_direction = df_full.groupby(wind_direction_label)[avg_temp_label].mean().reset_index()

    # Order cardinal directions clockwise
    directions_ordered = ['North', 'Northeast', 'East', 'Southeast', 'South', 'Southwest', 'West', 'Northwest']
    avg_temp_by_direction[wind_direction_label] = pd.Categorical(avg_temp_by_direction[wind_direction_label], categories=directions_ordered, ordered=True)
    avg_temp_by_direction = avg_temp_by_direction.sort_values(wind_direction_label)

    # Color scale for temperature
    color_scale = plt.cm.get_cmap('coolwarm')

    # Create the bar chart
    fig, ax = plt.subplots(figsize=(8, 6))

    # Loop through the data and create bars with corresponding color
    for i, row in avg_temp_by_direction.iterrows():
        color = color_scale(row[avg_temp_label] / avg_temp_by_direction[avg_temp_label].max())
        wind_direction = row[wind_direction_label]
        average_wind_speed = df[df[wind_direction_label] == wind_direction][avg_windspd_label].mean()
        ax.bar(wind_direction, average_wind_speed, color=color, edgecolor='black')

    # Chart settings
    ax.set_xlabel('Wind Direction')
    ax.set_ylabel('Wind Speed (m/s)')
    ax.set_title('Wind Speeds and Average Temperatures by Wind Direction')

    # Create a separate Axes for the colorbar
    cax = fig.add_axes([0.95, 0.1, 0.04, 0.79])

    # Add color legend
    temperature_min = avg_temp_by_direction[avg_temp_label].min()
    temperature_max = avg_temp_by_direction[avg_temp_label].max()
    sm = plt.cm.ScalarMappable(cmap=color_scale, norm=plt.Normalize(vmin=temperature_min, vmax=temperature_max))
    sm.set_array([])  # empty data for the color legend
    cbar = plt.colorbar(sm, cax=cax, orientation='vertical')
    cbar.set_label('Average Temperature (°C)')

    # Show the chart
    plt.show()

# Beispielaufruf der Funktion
plot_wind_speed_and_avg_temp_by_direction(df_full, df, 'wind_direction_label', 'avg_temp_celsius', 'avg_windspd')
plt.savefig('../output/Bancroft Wind Speeds and Average Temperatures by Wind Direction.png')
<Figure size 2000x1500 with 0 Axes>

Wind Speed The bars height represent the average wind speed for each wind direction. The highest wind speeds are observed from northern and western directions. The lowest wind speeds occur with southeastern and eastern winds.

Average Temperature The color of the bars represents the average temperature, with warmer temperatures in darker shades of red and cooler temperatures in darker shades of blue. Winds from the southwest bring the highest temperatures. Northern winds are associated with the lowest temperatures.

Interpretation Wind speed varies with wind direction. Northern and western winds bring higher wind speeds, which may indicate stronger wind events or a general tendency for higher wind speeds from these directions. The color coding of the bars shows that southwestern winds tend to correlate with warmer temperatures, while northern winds bring cooler temperatures.

This suggests the origin of the air masses: warmer air masses from the southwest and colder ones from the north. The higher wind speeds from northern and western directions might indicate more frequent and stronger wind events in these directions. Warmer temperatures with southwestern winds could suggest warmer air masses being transported from these directions. The cooler temperatures with northern winds might reflect colder air masses originating from polar regions.

Conclusions The data show that wind direction has a significant impact on wind speed and temperature. These findings can be important for weather forecasting and sectors like energy production, where wind energy and temperature management are crucial factors. Further analyses might focus on how these patterns vary over time and what impact they may have on local climate conditions. Concluding from this analysis, it seems like wind and temperature are correlated or even having a causal connection, as southern winds are milder and warmer, while northern winds are stronger and colder. This hypotheses is to be evaluated in the further progress of this work.

4.4 Plotting Data in a 3D space¶

Building upon the comprehensive analysis of weather events and their directional influences, a transition is now made to a more abstract yet insightfuThe multi-dimensional weather data will be condensed into three principal components, providing a visual exploration of the intrinsic structure and variability of the data. By plotting this 3D PCA scatter plot, hidden patterns, clusters, or anomalies across the weather events are anticipated to be uncovered. This approach aims to offer a more profound understanding of the complex interplay between various meteorological elements. realm of data visualization.— a 3D space facilitated by Principal Component Analysis (PCA). This advanced method is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information. Let's venture into this 3D representation to discern the subtle yet critical aspects of weather dynamics that might elude simpler analyses.

In [22]:
import plotly.express as px
from sklearn.decomposition import PCA

feature_columns = df.columns
columns_to_exclude = ['wep', 'run_datetime', 'avg_temp_celsius', 'avg_winddir', 'wind_direction_label', 'label0', 'label1', 'label2', 'year', 'month', 'display_label']
feature_columns = [col for col in feature_columns if col not in columns_to_exclude]

X = df[feature_columns].dropna()
X_standardized = StandardScaler().fit_transform(X)

# Apply PCA to reduce dimensionality to 3 components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_standardized)

# Create a DataFrame with PCA results and 'wep' label
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2', 'PC3'])
pca_df['wep'] = df.loc[X.index, 'wep']

# Create an interactive 3D scatter plot using Plotly Express with smaller dots
fig = px.scatter_3d(pca_df, x='PC1', y='PC2', z='PC3', color='wep', symbol='wep', size_max=2, opacity=0.7,
                    labels={'wep': 'Weather'}, title='Interactive 3D PCA Plot of Weather Data')

# Update layout for better visualization
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

# Update marker size for better visibility
fig.update_traces(marker=dict(size=2),
                  selector=dict(mode='markers'))

# Increase marker size in the legend
fig.update_layout(legend=dict(
    itemsizing='constant'
))

# Show the plot
fig.show()

The PCA analysis of the weather dataset has provided a detailed 3D model, revealing a non-uniform distribution of data points with a pronounced cone-like shape spreading primarily along the PC1 and PC2 axes, indicating that these components capture the majority of variability within the weather conditions. The range of PC1, extending from -5 to 15, and PC2, from -5 to 17.5, underscores the significant spread of certain weather events. Notably, 'Storm with freezing rain / Heavy snow- and icestorm' demonstrates an exceptional reach beyond 10 on PC1 and over 15 on PC2, marking it as an outlier, while 'Blue sky day' is distinguished by its extension below 0 on PC1 and -5 to 10 on PC2. These conditions, along with 'Mild snowfall', which also appears below 0 on PC2, highlight the wide variation in meteorological features and their considerable impact on the PCA model.

The dense clustering around the origin suggests that the most common weather patterns share a high degree of overlap in key features such as average temperature and precipitation. This dense core, contrasted with the extended tail and outliers in the PCA plots, suggests a gradient from frequent to rare and extreme weather events, with the latter possibly driven by extreme values in features like wind speeds or temperature changes.

However, interpreting this 3D PCA model is not without challenges. Separating clusters, especially within PC3, which has a smaller range from -8 to 8, is difficult due to the subtlety that this component captures, potentially related to less dominant weather features. The considerable overlap of conditions within PC1 further complicates this task, demonstrating the complexity of distinguishing between different weather patterns.

The distinct spread and visibility of events such as 'Storm with freezing rain / Heavy snow- and icestorm' and 'Blue sky day' at first glance provide valuable insights into the extremities of the weather dataset. This visibility serves as an essential starting point for further investigation, possibly guiding targeted research into the factors contributing to these extreme phenomena.

5 Model¶

First, an overview of various regression and classification algorithms is given. Then three different algorithms will be evaluated for each research question in order to investigate their quality in relation to the research question.

5.1 Overview Classification and Regression Algorithms¶

In general, different types of algorithms have varying use cases, thus necessitating consideration of the application scenarios.

In weather data analysis, the choice between classification and regression depends on the nature of the target variable. Classification is better suited for predicting categorical outcomes, such as weather events. The distinct boundaries between these events and the intuitive interpretation of classification results make it particularly suitable for this purpose. Classification is also adept at handling multiple classes and imbalanced data, common characteristics in weather event datasets. The actionable insights provided by classification outcomes are valuable in decision-making scenarios.

On the other hand, regression is more appropriate for predicting continuous variables, such as temperature. Temperature changes often occur gradually and exhibit a numerical gradient, making regression well-equipped to capture these subtle variations. Regression models are effective in scenarios where precise numerical predictions are essential, providing a quantitative representation of temperature values. Additionally, regression models, particularly those designed for time-series analysis, can account for temporal patterns commonly found in temperature data.

In summary, while classification excels in providing clear categorizations of weather events, regression is better suited for the nuanced prediction of continuous variables like temperature, offering precise numerical estimations and accommodating subtle variations over time or space.

  1. Linear Regression: Linear Regression Models describe a linear relationship between a dependent and on or more independent variables. The least squared method is used to fit a straight line to represent the response variable [^24].

  2. Logistic Regression: Logistic Regression models are used for classification. They offer simplicity and interpretability, which make them ideal for straightforward binary classification tasks. They perform well in scenarios with smaller datasets and maintain efficiency while being less prone to overfitting. Their effectiveness diminishes when faced with complex, non-linear relationships within the data [^25].

  3. Decision Trees and Random Forests: Decision Trees or Regression Trees and Random Forests as an enhancement can both be used for regression and classification. They can be useful for use-cases with non-linear relationships. They are robust to outliers, but prone to overfitting [^26]. Decision trees are useful for non-linear relationships, with sets of easy definable rules (if a and b happens, but c is not given, d will happen). Decision trees prone to overfitting and are therefore not well-suited for very complex use cases [^27]. Random Forests, an asemble of multiple decision trees, are easier to overcome overfitting and parameters can be set easily. Also, the variable importance and accuracy is generated automatically. Random Forests are therfore better suited for modelling high dimensional data [^28].

  4. Gradient Boosting (e.g., XGBoost, LightGBM): Gradient Boosting uses an ensemble of machine learning models and can reach a high accuracy. An initital Model is fitted and on top of that another model precits the accuracy of the first model. This process is called boosting and leads to a highly accurate model, which is very robust to outliers. The disadvantages are, that they are computionally expensive and the models need hyperparamter tuning [^29].

  5. Support Vector Regression (SVR) / Support Vector Machines (SVM):
    Support Vector Regression (SVR) and Support Vector Machines (SVM) excel when there is a clear separation between classes. They are effective in high-dimensional spaces, and perform well when dimensions outnumber samples while beeing also relatively memory-efficient. Handling large datasets and performing when faced with noisy data or overlapping target classes are weeknesses of SVMs. Additionally, if the number of features surpasses the number of training data samples, SVM may underperform [^30].

  6. K-Nearest Neighbors (KNN):
    The classification algorithm K-Nearest Neighbors (KNN) is especially known for their easy implementation method and not needing a training period, which makes them a suitable choice for straightforward tasks. Sensitive to noisy data, KNN's performance is impaired by large datasets with high dimensionality [^31].

  7. Naive Bayes:
    Naive Bayes is referred as the most efficient and effective clasification algorithm [^32]. Praised for its simplicity and computational efficiency, making it a quick choice for certain applications. It handles high-dimensional data effectively but assumes feature independence, impacting its performance with correlated features.

  8. Neural Networks (Deep Learning):
    Neural Networks, are adept at capturing intricate patterns in data, making them well-suited for complex tasks. They are well suited for handling large and intricate datasets but come with the trade-offs of requiring substantial amounts of data and being computationally expensive. The black-box nature of neural networks can also limit their interpretability. However, the application of Neural Networks is out of scope for this analysis[^33].


    5.2 Potential Challenges (Feedback Project Proposal)¶

    Predicting weather-related data, such as the ERA5 dataset, poses various challenges due to the complex and dynamic nature of atmospheric phenomena.

    One key challenge lies in the inherent temporal dependencies within the data. Time series data, such as weather data over a period of time often exhibits temporal dependencies, where observations at one point are correlated with those at nearby points. A strategy for solving this challenge could be incorporating lagged variables as features in order to capture temporal patterns. By utilizing past values of relevant variables at specific time intervals, the model can effectively capture temporal patterns and dependencies, enhancing its predictive capabilities.

    Another relevant aspect is, that the dataset has seasonality and trends that affect the target variable. Weather patterns often demonstrate cyclic variations and underlying trends that can significantly impact the target variable. To mitigate this, detrending and deseasonalizing techniques become imperative. Through methods such as differencing or seasonal decomposition, the data can be processed to emphasize the underlying patterns while minimizing the influence of seasonal fluctuations, thus allowing the model to focus on essential predictive features.

    Furthermore, the imbalanced distribution of weather classes within the dataset poses a practical challenge for classification tasks. The occurrence of certain weather conditions are naturally more infrequent than others, leading to imbalances in the classes. Addressing this challenge involves employing techniques like oversampling, undersampling, or synthetic data generation to rebalance the classes. Tuning classification algorithms to account for class imbalances is crucial for ensuring accurate predictions across all weather categories. By systematically addressing these challenges, a robust framework for predicting weather-related variables in the ERA5 dataset can be established, contributing to more accurate and reliable forecasting models.


    In [88]:
    #import dependencies for models (import2)
    from sklearn import svm
    
    from datetime import datetime
    
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
    
    from sklearn.linear_model import LinearRegression, LogisticRegression, SGDRegressor
    
    from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier
    
    from sklearn.tree import DecisionTreeClassifier
    
    from sklearn.neighbors import KNeighborsClassifier
    
    from sklearn.svm import SVR
    
    from sklearn.pipeline import make_pipeline
    
    from xgboost import XGBClassifier
    from lightgbm import LGBMClassifier
    
    import statsmodels.api as sm
    
    import pickle
    

    5.3. Regression Analysis:¶

    5.3.1. Regression Analysis Temperature and Wind Modeling:¶

    Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques? This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather.

    The correlation or causation between the temperature and windspeed, windgust or winddirection will be examined through regression analysis. This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather.

    5.3.1.1 Regression Analysis Temperature and Wind Modeling: Select Model¶

    The first step is to instantiate the selected models. A variety of regression techniques will be used including Linear Regression, Gradient Boosting Regressor, Stochastic Gradient Descent Regressor, and Support Vector Regressor. Each model uses a unique approach to learn the relationship between temperature and wind characteristics.

    In [24]:
    # Create Linear Regression model
    lr_model_1 = LinearRegression()
    
    # Create Gradient Boosting Regressor
    gb_model_1 = GradientBoostingRegressor(n_estimators=100, random_state=42)
    
    # Create Stochastic Gradient Descent Regressor
    sgd_model_1 = SGDRegressor(max_iter=1000, random_state=42)
    
    # Create Support Vector Regressor
    svr_model_1 = SVR()
    

    5.3.1.2 Regression Analysis Temperature and Wind Modeling: Training and validation¶

    After the training models are instantiated, the next step is to prepare the dataset for training and validation. This involves selecting the features (independent variables) and the dependent variable (temperature), and then splitting the dataset into training and test sets. This process ensures that the models can learn from a portion of the data (training set) and then be validated against an unseen portion (test set) to assess their predictive performance and generalization capability.

    In [25]:
    # Select the features (independent variable)
    X = df['avg_windspd'].values.reshape(-1, 1)
    
    # Select the dependent variable
    y = df['avg_temp']
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Print the shapes of the sets
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
    
    X_train shape: (52276, 1)
    X_test shape: (13069, 1)
    y_train shape: (52276,)
    y_test shape: (13069,)
    

    After selecting the features (average wind speed) as the independent variable and temperature as the dependent variable, the dataset is divided into training and testing sets using a 80-20 split. This means 80% of the data will be used for training the model, and the remaining 20% will be used for testing. The shapes of the resulting sets are as follows:

    • X_train (features for training) contains 52,276 samples, each with 1 feature (average wind speed).
    • X_test (features for testing) contains 13,069 samples, again each with 1 feature.
    • y_train (temperature for training) contains 52,276 samples corresponding to the temperatures.
    • y_test (temperature for testing) contains 13,069 samples corresponding to the temperatures.

    5.3.1.3 Regression Analysis Temperature and Wind Modeling: Fit model¶

    The next step is the fitting, which is the actual training of the models on the prepared data. Each model is fitted separately using the training set.

    In [26]:
    # fit Linear Regression model
    lr_model_fit_1 = lr_model_1.fit(X_train, y_train)
    
    # fit Gradient Boosting Regressor
    gb_model_fit_1 = gb_model_1.fit(X_train, y_train)
    
    # fit Stochastic Gradient Descent Regressor
    sgd_model_fit_1 = sgd_model_1.fit(X_train, y_train)
    
    # fit Support Vector Regressor
    svr_model_fit_1 = svr_model_1.fit(X_train, y_train)
    

    Each model is now trained and ready to make predictions. The subsequent step is typically the evaluation, which involves testing the models on unseen data to assess their performance. This step is crucial in determining how well each model generalizes to new data, which is vital for real-world applications. It's also important to compare the models against each other to understand their respective strengths and weaknesses, ensuring the selection of the most appropriate model for predicting temperature based on wind characteristics. This comparative analysis will guide the final decision on which model to deploy for effective and reliable predictions.

    Next, a method for evaluation is defined, and the models are then assessed based on this approach.

    5.3.1.4 Regression Analysis Temperature and Wind Modeling: Evaluation on test set¶

    In [27]:
    # Function to evaluate and print model performance
    def evaluate_model(model_name, model, X_test, y_test):
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
        print(f'{model_name} Model:')
        print(f'Mean Squared Error: {mse:.2f}')
        print(f'Mean Absolute Error: {mae:.2f}')
        print()
    
    # Evaluate Linear Regression model
    evaluate_model('Linear Regression', lr_model_fit_1, X_test, y_test)
    
    # Evaluate Gradient Boosting model
    evaluate_model('Gradient Boosting', gb_model_fit_1, X_test, y_test)
    
    # Evaluate Stochastic Gradient Descent model
    evaluate_model('Stochastic Gradient Descent', sgd_model_fit_1, X_test, y_test)
    
    # Evaluate Support Vector Regression model
    evaluate_model('Support Vector Regression', svr_model_fit_1, X_test, y_test)
    
    Linear Regression Model:
    Mean Squared Error: 127.78
    Mean Absolute Error: 9.63
    
    Gradient Boosting Model:
    Mean Squared Error: 127.72
    Mean Absolute Error: 9.63
    
    Stochastic Gradient Descent Model:
    Mean Squared Error: 127.78
    Mean Absolute Error: 9.63
    
    Support Vector Regression Model:
    Mean Squared Error: 129.85
    Mean Absolute Error: 9.57
    
    

    Model Evaluation Results¶

    The models have been evaluated on the test set using two metrics: Mean Squared Error (MSE) and Mean Absolute Error (MAE). These metrics provide a quantitative measure of each model's accuracy in predicting temperature based on wind speed. Lower values indicate better performance. Here are the results:

    The results indicate that while all models perform similarly, there are slight variations in their accuracy. This nuanced understanding of each model's performance will guide the final decision on the most appropriate model for deployment.

    #### 5.3.1.5 Regression Analysis Temperature and Wind Modeling: Visualize results

    Transitioning from model evaluation to visualization is crucial, as it provides an intuitive understanding of model performance. Visualization facilitates the immediate identification of patterns, outliers, and the overall fit of the models through graphical representation of actual versus predicted values and the distribution of residuals.

    In [28]:
    from scipy.stats import zscore
    
    # Function to create subplots for model visualization and outlier analysis
    def visualize_model(model_name, model, X_test, y_test, subplot_idx):
        y_pred = model.predict(X_test)
    
        # Calculate residuals
        residuals = y_test - y_pred
    
        # Calculate MSE and MAE
        mse = mean_squared_error(y_test, y_pred)
        mae = mean_absolute_error(y_test, y_pred)
    
        # Detect outliers using Z-scores
        z_scores = zscore(residuals)
        outliers = np.abs(z_scores) > 2  # Threshold for considering a point as an outlier
    
        # Create subplots for actual vs predicted values
        plt.subplot(2, 4, subplot_idx)
        plt.scatter(X_test[outliers], y_test[outliers], color='orange', label='Outliers')
        plt.scatter(X_test[~outliers], y_test[~outliers], color='blue', label='Actual')
        plt.scatter(X_test, y_pred, color='red', label='Predicted', alpha=0.7)
        plt.plot(X_test, y_pred, color='green', linewidth=2)
        plt.title(f'{model_name} Model\nMSE: {mse:.2f}, MAE: {mae:.2f}')
        plt.xlabel('Average Wind Speed')
        plt.ylabel('Average Temperature')
    
        # Create subplots for residual plots
        plt.subplot(2, 4, subplot_idx + 4)
        plt.scatter(X_test[outliers], residuals[outliers], color='orange', label='Outliers')
        plt.scatter(X_test[~outliers], residuals[~outliers], color='purple')
        plt.axhline(y=0, color='black', linestyle='--', linewidth=2)
        plt.title(f'Residuals\nMean Residual: {np.mean(residuals):.2f}')
        plt.xlabel('Average Wind Speed')
        plt.ylabel('Residuals')
    
    # Create subplots
    plt.figure(figsize=(16, 8))
    
    # Visualize Linear Regression model with outliers
    visualize_model('Linear Regression', lr_model_fit_1, X_test, y_test, 1)
    
    # Visualize Gradient Boosting model with outliers
    visualize_model('Gradient Boosting', gb_model_fit_1, X_test, y_test, 2)
    
    # Visualize Stochastic Gradient Descent model with outliers
    visualize_model('Stochastic Gradient Descent', sgd_model_fit_1, X_test, y_test, 3)
    
    # Visualize Support Vector Regression model with outliers
    visualize_model('Support Vector Regression', svr_model_fit_1, X_test, y_test, 4)
    
    # Adjust layout for better visualization
    plt.tight_layout()
    
    # Create a single legend at the top of the entire plot
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, shadow=True, ncol=4)
    
    plt.show()
    plt.savefig('../output/Linear Regression Analysis Temperature and Wind Modeling Results.png')
    
    <Figure size 2000x1500 with 0 Axes>

    The plots present a comparative analysis of the models, applied to predict average temperature from average wind speed. The upper plots display the actual versus predicted temperatures, and the lower plots show the residuals (differences between actual and predicted values) for each model.

    For all models, the distribution of data points indicates a weak relationship between wind speed and temperature, as shown by the dispersed and broadly scattered plots. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) are relatively high across all models, suggesting that the models are not accurately predicting temperature based on wind speed alone. The presence of numerous outliers, particularly evident in the residual plots where they are marked in orange, further signifies that the models' predictions deviate significantly from the actual values for many observations.

    The residual plots are centered around the zero line, which is expected, but the wide spread of residuals indicates that the prediction errors are substantial. The mean residuals are close to zero for the Linear Regression, Gradient Boosting, and Stochastic Gradient Descent models, showing no bias in over or underestimation. However, the Support Vector Regression model has a negative mean residual, indicating a slight tendency to underpredict temperatures.

    The results align with the preliminary analysis, suggesting that while wind speed does have some relationship with temperature, it is not a strong predictor on its own. The complex nature of temperature patterns, influenced by various climatic factors, cannot be captured through a simple regression with wind speed as the sole predictor. This justifies the need for a multiple regression analysis incorporating additional variables that affect temperature.

    The reference to a subsequent picture showing the results for a linear regression analysis between wind speed and wind gust implies that in cases where two variables are intrinsically linked and exhibit a clear, linear relationship, simple regression can yield low MSE and MAE values, indicative of a successful model fit. However, such an analysis may be of limited scientific interest when the variables are inherently correlated and do not provide unique or insightful information about the system being studied.

    The visualized results proof, that the temperature variable can hardly be determined by the wind speed variable. The MSE and MAE show very bad values and the results contain a lot of outliers. This was to be expected after the EDA in the previous chapters, as temperature and wind speed are influenced by completely different climate phenomena. Also the time series analysis showed a big difference of the visualisation of the temperature and wind variables. E.g., while there was a profound seasonal pattern and a bimodal distribution in the temperature, none of this was true for the wind paramaters. Also the association plots proofed the same. To predict the temperature more accurately, a multiple regression analysis, which uses multiple predictor variables should be used.

    The following picture 2 shows the results for a linear regression analysis for windspeed and windgust, which was experimentively created by a previous iteration (by simply switching the relevant parameters in the python code). Without discussing the results of this analysis or the visualization further, the picture is supposed to show how a good and succesfull linear regression analysis should look like (linear distribution of the dataplots, small MSE and MAE). From a scientific view, these results are barely interesting tho, since the windspeed and windgust parameters are from nature very similar, do correlate and even have a causal connection, since they derrive from the same climate phenommena. Predicting one by the other seems to be redundant, as they are also measured by the same sensors.

    The following analysis is focusing on multiple regressor analysis, to improve the prediction of the temperature feature.

    Linear Regression Analysis Windspeed and Windgust Results Picture 2: Linear Regression Analysis Windspeed and Windgust Results (Created by previous experimentive analysis iteration)

    5.3.1.6 Regression Analysis Temperature and Wind Modeling: Save Model¶

    Saving the trained models with timestamped filenames allows for organized storage and easy retrieval for subsequent analysis or deployment.

    In [31]:
    # Create a meaningful name with a timestamp
    model_name_lr_1 = f"lr_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_gb_1 = f"gb_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_sgd_1 = f"sgd_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_svr_1 = f"svr_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name_lr_1, 'wb') as f:
        pickle.dump(lr_model_fit_1, f)
    with open(folder_path + model_name_gb_1, 'wb') as f:
        pickle.dump(gb_model_fit_1, f)
    with open(folder_path + model_name_sgd_1, 'wb') as f:
        pickle.dump(sgd_model_fit_1, f)
    with open(folder_path + model_name_svr_1, 'wb') as f:
        pickle.dump(svr_model_fit_1, f)
    

    5.3.2 Linear Regression Analysis: Multiple Predictors¶

    In this chapter linear regression with multiple predictor variables is used to predict the temperature variable with an improved accuracy. The analysis uses techniques for multivariate temperature prediction / linear regression of temperature with multiple predictors. It is supposed to answer the research question: How does the incorporation of multiple atmospheric predictors, such as windspeed, windgust, winddirection, air pressure, snow and ice parameters enhance the accuracy of temperature prediction compared to a model solely based on windspeed? This question delves into the complex interplay of various atmospheric variables and their combined effect on temperature changes over different time scales (daytime, day, season, etc.). The outcome of this analysis will contribute to a nuanced model that considers a spectrum of atmospheric conditions for more precise temperature forecasting. This tasks also includes the regression model to uncover interactions and synergies among different atmospheric predictors in influencing temperature variations. This question seeks to identify potential non-linear relationships or combined effects of multiple predictors on temperature. Understanding these interactions is crucial for refining the predictive model and gaining insights into the intricate dynamics of the atmosphere. For this tasks foreward selection and backward deletion methods will be used. Furthermore temporal dynamics are to be analyzed. How does the temporal aspect, including diurnal patterns and seasonal variations, impact the relationship between temperature and multiple predictors? This question aims to capture the temporal dynamics of the atmosphere and assess how the predictive power of the model varies across different timeframes. Examining temporal patterns will contribute to a more nuanced understanding of how atmospheric predictors influence temperature fluctuations under varying conditions.

    5.3.2.1 Feature Selection¶

    5.3.2.1.1 Check for Multicollinearity - Correlation Matrix

    If two independent variables exhibit a high degree of correlation, it gives rise to a phenomenon known as multicollinearity, posing challenges for analysis and interpretation.

    To assess potential multicollinearity, the initial step involves examining correlation coefficients for each pair of continuous (scale) variables. These will be displayed in a correlation matrix. A correlation matrix is a statistical tool that provides a comprehensive view of the relationships between multiple variables in a dataset. It is a square matrix where each entry represents the correlation coefficient between two variables, indicating the strength and direction of their linear relationship.

    The correlation coefficient ranges from -1 to 1:

    • 0.9 to 1 or -0.9 to -1 --> Perfect Correlation (positive or negative)
    • 0.5 to 0.9 or -0.5 to -0.9 --> Strong
    • 0.1 to 0.5 or -0.1 to -0.5 --> Weak
    • 0.0 to 0.1 or 0.0 to -0.1 --> Uncorrelated

    Correlations of 0.8 or higher indicate a robust relationship, implying that only one of the two variables is necessary for regression analysis.

    A positive correlation suggests that as one variable increases, the other tends to increase as well, while a negative correlation indicates an inverse relationship. This information is valuable for understanding the interplay among different factors, aiding in the interpretation of data, and informing decision-making processes. Additionally, the correlation matrix is a crucial tool in the field of feature selection, helping to identify and retain only the most relevant variables in predictive modeling and analysis.

    In [32]:
    #drop columns with datatype object
    df_feature_selection = df.drop(columns=['run_datetime','wep','wind_direction_label','label2','display_label'])
    
    # Calculate the correlation matrix
    correlation_matrix = df_feature_selection.corr()
    
    # Create a heatmap for the correlation matrix
    plt.figure(figsize=(20, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
    plt.title('Correlation Matrix of Variables')
    plt.show()
    plt.savefig('../output/Linear Regression Analysis Multiple Predictors Correlation Matrix of Variables.png')
    
    <Figure size 2000x1500 with 0 Axes>

    The heatmap represents a correlation matrix for various weather-related variables. In this matrix, the color intensity reflects the strength of the correlation between any two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation.

    Highly Correlated Temperature Variables: There's a block of variables, including 'avg_temp', 'avg_temp_celsius', 'min_wet_bulb_temp', and 'avg_dewpoint', which are all very highly correlated to each other (correlation coefficients above 0.97). This suggests that they are essentially providing the same information about the temperature.

    Wind Variables: 'avg_windspd' and 'max_windgust' are moderately correlated (0.85), which is expected as gusts are generally related to average wind speeds. However, their correlations with temperature variables are low, indicating that they may not be strong predictors of temperature.

    Directional Wind Variables: 'avg_winddir' and its sine/cosine transformations ('avg_winddir_sin', 'avg_winddir_cos') have very low correlations with other variables, implying they have little or no linear relationship with temperature and other meteorological factors.

    Precipitation Variables: 'max_cumulative_precip', 'max_snow_density_6', and 'max_cumulative_snow' show moderate to high correlations with each other. 'max_snow_density_6', and 'max_cumulative_snow' are negatively correlated with temperature variables, which fits the expectation that snow are associated with lower temperatures.

    Pressure Change: 'avg_pressure_change' shows a moderate negative correlation with 'avg_temp' and is highly negatively correlated with 'label1'.

    Labels: 'label0' and 'label1' are designed to be highly negatively correlated (-0.77). Their correlation with temperature is medium, suggesting that they may be significant for temperature prediction. The negative correlation of 'label0' with temperature suggests it is inversely related to temperature, whereas 'label1' has a positive correlation.

    Given the set threshold for correlation (0.8), the process suggests dropping 'avg_temp_celsius', 'min_wet_bulb_temp', and 'avg_dewpoint' due to their redundancy with 'avg_temp', which is the primary temperature measure. The exclusion of 'max_windgust' alongside 'avg_windspd' may be due to their collinearity, while 'avg_temp_change' is removed for being derivative of the temperature itself.

    The removal of 'avg_pressure_change' is not depicted in this correlation matrix but is mentioned as a data consistency measure, aligning with the need to have complete data across the temporal scope of the study.

    Lastly, 'label0' is considered for exclusion to avoid potential bias and overfitting, given its strong correlation with the target variable. This approach is consistent with practices aimed at improving model generalizability and accuracy.

    In the feature selection process, several variables were dropped from the dataset due to various considerations.

    Setting the correlation threshold to 0.8 ,the variables 'avg_temp_celsius', 'min_wet_bulb_temp', 'avg_dewpoint' and 'max_windgust' should be excluded because they exhibit high collinearity, indicating redundant information in the model. This step is essential to enhance the model's interpretability and avoid multicollinearity issues.

    Additionally, 'avg_temp_change' is removed as it is obviously highly correlated to the temperature.

    Additionally, the variable 'avg_pressure_change' was removed as it lacked data for the year 2022, making it inconsistent with the temporal scope of the analysis. Maintaining data consistency is crucial for the reliability of the model.

    Furthermore, 'label0' was dropped from the dataset because it demonstrated a strong correlation with the target variable. Including such highly correlated variables may introduce bias and hinder the model's ability to generalize to new data. Therefore, the decision to exclude 'label0' aims to improve the model's predictive accuracy and reduce overfitting

    In [33]:
    #set "avg_temp_change" to 0 in the first row, as there is no temp change
    df_feature_selection.at[0, 'avg_temp_change'] = 0
    
    #drop columns, which are not numeric and not needed
    df_feature_selection = df_feature_selection.drop(columns = ['min_wet_bulb_temp', 'avg_temp_change', 'max_windgust' , 'avg_dewpoint' , 'avg_temp_celsius','avg_pressure_change','label0'] )
    
    #show adjusted df
    df_feature_selection.head()
    
    Out[33]:
    avg_temp avg_windspd avg_winddir avg_winddir_sin avg_winddir_cos max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice label1
    0 287.389224 3.386380 80.302464 -0.981653 0.190676 2.009 0.0 0.0 0.0 1
    1 287.378997 3.326687 76.866373 0.994736 0.102466 1.209 0.0 0.0 0.0 1
    2 287.388845 3.243494 76.258867 0.758262 0.651950 0.400 0.0 0.0 0.0 1
    3 287.427324 3.145505 78.299616 0.237897 -0.971290 0.000 0.0 0.0 0.0 1
    4 287.489158 3.047607 84.632852 0.189006 -0.981976 0.000 0.0 0.0 0.0 1

    5.3.2.1.2 Feature Forward Selection:

    Moving into the feature selection phase, the process shifts focus to refining the predictor variables to enhance the model's predictive power. This involves identifying the most significant features through forward selection, a methodical approach to building the regression model by adding variables sequentially based on their statistical significance and contribution to the model's overall explanatory power.

    In [34]:
    # Extracting the target variable and predictor variables
    y = df_feature_selection['label1']
    #drop target variable and strongly correlated varaibles
    X = df_feature_selection.drop(columns=['label1'])
    
    # Initializing an empty list to store the selected variables
    selected_vars = []
    # Initializing an empty list to store adjusted R-squared scores for each step
    adjusted_r2_scores = []
    
    # Performing forward selection
    for i in range(len(X.columns)):
        best_score = -1  # Initialize with a value that will be replaced on the first iteration
        best_var = None
    
        # Iterate through remaining variables
        for var in X.columns:
            if var not in selected_vars:
                # Include the variable in the model
                current_vars = selected_vars + [var]
                X_current = X[current_vars]
                X_current = sm.add_constant(X_current)  # Adding a constant for the intercept
    
                # Fit the model
                model = sm.OLS(y, X_current).fit()
                current_score = model.rsquared_adj
    
                # Update the best variable if the score is higher
                if current_score > best_score:
                    best_score = current_score
                    best_var = var
    
        # Update selected variables and store results
        selected_vars.append(best_var)
        adjusted_r2_scores.append(best_score)
        print(f"Step {i + 1}: Added {best_var}, Adjusted R-squared: {best_score:.4f}")
    
    # Display the results
    results_df = pd.DataFrame({'Variable': selected_vars, 'Adjusted R-squared': adjusted_r2_scores})
    print("\nFinal Forward Selection Results:")
    print(results_df)
    
    Step 1: Added max_snow_density_6, Adjusted R-squared: 0.4062
    Step 2: Added avg_temp, Adjusted R-squared: 0.5109
    Step 3: Added max_cumulative_snow, Adjusted R-squared: 0.5506
    Step 4: Added max_cumulative_ice, Adjusted R-squared: 0.5527
    Step 5: Added avg_windspd, Adjusted R-squared: 0.5541
    Step 6: Added avg_winddir, Adjusted R-squared: 0.5551
    Step 7: Added max_cumulative_precip, Adjusted R-squared: 0.5556
    Step 8: Added avg_winddir_sin, Adjusted R-squared: 0.5556
    Step 9: Added avg_winddir_cos, Adjusted R-squared: 0.5556
    
    Final Forward Selection Results:
                    Variable  Adjusted R-squared
    0     max_snow_density_6            0.406223
    1               avg_temp            0.510857
    2    max_cumulative_snow            0.550552
    3     max_cumulative_ice            0.552674
    4            avg_windspd            0.554109
    5            avg_winddir            0.555149
    6  max_cumulative_precip            0.555629
    7        avg_winddir_sin            0.555632
    8        avg_winddir_cos            0.555626
    

    5.3.2.1.3 Backward Feature Elimination:

    Next, the focus will shift to optimizing the feature set using a backward feature elimination approach. By training a RandomForestClassifier with all the identified relevant features, an initial accuracy benchmark will be established. Then, through an iterative process, features will be selectively removed one by one, reassessing the model's accuracy with each reduction. The goal is to determine the smallest set of features that maintain or improve the initial accuracy, thereby streamlining the model for efficiency while preserving its predictive strength. This method aims to reduce complexity, potentially improve performance, and offer insights into the most influential predictors.

    In [35]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train a model with all features
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    initial_accuracy = accuracy_score(y_test, y_pred)
    
    # Perform Backward Feature Elimination
    all_features = list(X.columns)
    selected_features = all_features.copy()
    
    while len(selected_features) > 1:  # Stopping criterion: keep at least one feature
        # Train model with the current set of features
        model.fit(X_train[selected_features], y_train)
        y_pred = model.predict(X_test[selected_features])
        
        # Evaluate model performance
        current_accuracy = accuracy_score(y_test, y_pred)
        
        # If removing a feature improves performance, remove the least important one
        if current_accuracy >= initial_accuracy:
            break
        else:
            least_important_feature = min(selected_features, key=lambda f: model.feature_importances_[selected_features.index(f)])
            selected_features.remove(least_important_feature)
    
    print("Optimal Feature-Subset:")
    for feature in selected_features:
        print(feature)
    print("Accuracy: ",current_accuracy)
    
    Optimal Feature-Subset:
    avg_temp
    avg_windspd
    avg_winddir
    avg_winddir_sin
    avg_winddir_cos
    max_cumulative_precip
    max_snow_density_6
    max_cumulative_snow
    max_cumulative_ice
    Accuracy:  0.9996939322059836
    

    The utilization of forward selection and backward elimination techniques in feature selection did not yield significant advantages in this study. Prior to the commencement of the Bachelor's thesis, a thorough evaluation of variables was conducted based on existing literature and relevant metrics. This meticulous pre-selection process ensured that only essential variables were included in the dataset, aligning with established theoretical foundations and methodological considerations.

    Given the thoughtful consideration applied during the variable selection phase, where variables were chosen based on their significance and relevance, the additional application of forward selection and backward elimination did not contribute substantially to the refinement of the feature set. The initial careful curation of variables, guided by a comprehensive understanding of the research domain, metrics, and existing knowledge, rendered these automated techniques less impactful in enhancing the model's performance.

    Therefore, the decision to forego extensive reliance on forward selection and backward elimination was grounded in the belief that the prior knowledge-driven variable selection approach had already captured the essential components for the study, ensuring a focused and meaningful analysis.

    5.3.2.1.4 Write Results to Df

    Write these relevant variables and temperature to new dataframe

    In [36]:
    # Extract relevant columns from the DataFrame
    results_df = df[["run_datetime","avg_temp", "avg_winddir", "avg_windspd","avg_winddir_sin","avg_winddir_cos","max_cumulative_precip","max_snow_density_6","max_cumulative_snow","max_cumulative_ice"]]
    
    # Set the 'run_datetime' column as the DataFrame index and convert it to datetime format
    results_df['run_datetime'] = pd.to_datetime(results_df['run_datetime'])
    results_df.set_index('run_datetime', inplace=True)
    
    # Remove rows with duplicated indices, keeping only the first occurrence
    results_df = results_df.loc[~results_df.index.duplicated(keep='first')]
    
    # Set "avg_temp_change" to 0 in the first row, as there is no temp change
    results_df.iloc[0, results_df.columns.get_loc("avg_temp")] = 0
    
    # Display the first few rows of the resulting DataFrame
    results_df.head()
    
    Out[36]:
    avg_temp avg_winddir avg_windspd avg_winddir_sin avg_winddir_cos max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice
    run_datetime
    2015-07-15 00:00:00 0.000000 80.302464 3.386380 -0.981653 0.190676 2.009 0.0 0.0 0.0
    2015-07-15 01:00:00 287.378997 76.866373 3.326687 0.994736 0.102466 1.209 0.0 0.0 0.0
    2015-07-15 02:00:00 287.388845 76.258867 3.243494 0.758262 0.651950 0.400 0.0 0.0 0.0
    2015-07-15 03:00:00 287.427324 78.299616 3.145505 0.237897 -0.971290 0.000 0.0 0.0 0.0
    2015-07-15 04:00:00 287.489158 84.632852 3.047607 0.189006 -0.981976 0.000 0.0 0.0 0.0
    5.3.2.2 Multiple Linear Regression¶

    5.3.2.2.1 Temperature and Wind Modeling over Time - Select Model (Multiple Linear Regression)

    In the initial phase of the Temperature and Wind Modeling over Time analysis, a Multiple Linear Regression is conducted, as introduced in the lecture. Multiple regression is akin to linear regression but involves more than one independent variable, meaning an attempt is made to predict a value based on two or more variables.

    5.3.2.2.2 Temperature and Wind Modeling over Time - Training and Validation (Multiple Linear Regression)

    In the following analysis, three linear regression models are trained to predict different meteorological parameters.

    In the first model, the target variable is avg_temp, and the predictor variables include avg_windspd and avg_winddir.

    For the second model, avg_temp remains the target variable, incorporating all previously evaluated variables.

    Lastly, the third model focuses on predicting avg_winddir with the inclusion of all previously assessed variables.

    These models aim to capture relationships between temperature, wind speed, wind direction, and other relevant features, providing insights into the complex dynamics of the observed meteorological data.

    The models will be evaluated with the following plots:

    1. Scatter Plot of True vs. Predicted Values:

      • Interpretation: This plot shows the relationship between the actual values (true values) and the predicted values. Ideally, the points should lie along a diagonal line, indicating a perfect match between the true and predicted values. Deviations from the diagonal line suggest differences between the predicted and actual values.
    2. Residual Plot:

      • Interpretation: The residual plot displays the distribution of residuals (the differences between actual and predicted values) against the predicted values. The horizontal line at y=0 represents zero residuals. Patterns or trends in the residuals could indicate systematic errors in the model. Ideally, residuals should be randomly scattered around the zero line.
    3. Histogram of Residuals:

      • Interpretation: The histogram provides a visual representation of the distribution of residuals. A symmetric and bell-shaped distribution suggests that the residuals are approximately normally distributed. Skewed or non-uniform distributions may indicate issues with the model's assumptions.

    Multiple Linear Regression (MLR) with target variable "Average Temperature" and predictor variables "Average Windspeed" and "Wind Direction"

    In [37]:
    # Extracting the target variable and predictor variables
    y = results_df['avg_temp']
    X = results_df[['avg_windspd','avg_winddir']]
    
    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    
    #Training (Multiple Linear Regression)
    regr = LinearRegression()
    regr.fit(X, y)
    
    # Fit Model (Multiple Linear Regression)
    y_pred = regr.predict(X_test)
    
    #Evaluate on Test Set (Multiple Linear Regression)
    mse = mean_squared_error(y_test, y_pred)
    print('Mean Squared Error:', mse)
    
    mae = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error:', mae)
    
    r2 = r2_score(y_test, y_pred)
    print('R-squared:', r2)
    
    Mean Squared Error: 128.267893719392
    Mean Absolute Error: 9.658907432391409
    R-squared: 0.015238184007991595
    

    The very low R-squared value that the linear model does not perform well on the given dataset.

    Visual evaluations of the regression model will be conducted through a series of plots to assess prediction accuracy and identify any patterns in the residuals that may suggest model adjustments.

    In [38]:
    # Create subplots
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Scatter Plot
    axes[0].scatter(y_test, y_pred)
    axes[0].set_xlabel('True Values')
    axes[0].set_ylabel('Predictions')
    axes[0].set_title('Scatter Plot of True vs. Predicted Values')
    
    # Residual Plot
    residuals = y_test - y_pred
    axes[1].scatter(y_pred, residuals)
    axes[1].axhline(y=0, color='r', linestyle='--')
    axes[1].set_xlabel('Predicted Values')
    axes[1].set_ylabel('Residuals')
    axes[1].set_title('Residual Plot')
    
    # Histogram of Residuals
    sns.histplot(residuals, kde=True, ax=axes[2])
    axes[2].set_xlabel('Residuals')
    axes[2].set_ylabel('Frequency')
    axes[2].set_title('Histogram of Residuals')
    
    # Add title
    fig.suptitle('MLR with target variable "Average Temperature" and predictor variables "Average Windspeed" and "Wind Direction"')
    
    # Adjust layout
    plt.tight_layout()
    
    # Show plots
    plt.show()
    

    The visualizations collectively indicate how well the Multiple Linear Regression (MLR) model performs in predicting average temperature from wind speed and direction.

    Scatter Plot Analysis: The scatter plot reveals a dense cluster of points without a clear linear pattern, indicating that the model does not predict the average temperature accurately across the range of true values. The cloud-like distribution of points, rather than a tight diagonal line, suggests that the model's predictions are not consistently aligned with the true values, leading to a broad spread of prediction errors.

    Residual Plot Analysis: The residual plot shows a wide dispersion of residuals across the predicted temperature range, with no discernible pattern. The lack of structure implies that the model errors are somewhat random, but the spread of residuals indicates variability in the model's performance across different temperature ranges. The concentration of points around the zero line for lower predicted values and the spread for higher values suggest the model may be less accurate at extreme temperatures.

    Histogram of Residuals Analysis: The histogram of residuals displays a distribution with two peaks, deviating from the expected normal distribution, which would typically indicate a well-fitting model. The skewness to the left implies that the model more frequently underestimates the temperature (more negative residuals). The presence of two peaks could indicate the model performs differently for two subsets of the data, possibly corresponding to different conditions or time periods within the dataset.

    Conclusion: The visual assessments suggest that the model's predictive power is limited. The residuals' behavior indicates potential model inadequacies, possibly due to omitted variables, non-linear relationships, or other model specification issues. The results highlight the need for further model refinement, such as incorporating additional predictive features, exploring non-linear models, or segmenting the data for different conditions. The ultimate goal is to achieve a more homogenous distribution of residuals and a narrower scatter plot distribution around the line of perfect prediction.

    Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)

    In [39]:
    # Create a meaningful name with a timestamp
    model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name, 'wb') as f:
        pickle.dump(regr, f)
    
    print(f"The MLR model has been saved to '{folder_path + model_name}'.")
    
    The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
    

    Multiple Linear Regression (MLR) with target variable "Average Temperature" and multiple predictor variables

    A Multiple Linear Regression model with average temperature as the target variable is constructed using a wider range of predictors. This model seeks to improve upon previous ones by utilizing a more extensive set of features that may influence temperature. The expected outcome is a more accurate prediction, as indicated by improvements in the Mean Squared Error, Mean Absolute Error, and R-squared metrics.

    In [40]:
    # Extracting the target variable and predictor variables
    y = results_df['avg_temp']
    X = results_df.drop(columns=['avg_temp'])
    
    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    
    #Training (Multiple Linear Regression)
    regr = LinearRegression()
    regr.fit(X, y)
    
    # Fit Model (Multiple Linear Regression)
    y_pred = regr.predict(X_test)
    
    #Evaluate on Test Set (Multiple Linear Regression)
    mse = mean_squared_error(y_test, y_pred)
    print('Mean Squared Error:', mse)
    
    mae = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error:', mae)
    
    r2 = r2_score(y_test, y_pred)
    print('R-squared:', r2)
    
    Mean Squared Error: 103.25403414868913
    Mean Absolute Error: 8.390307032430659
    R-squared: 0.20727917775583216
    

    The R-squared value has increased compared to before, but it remains relatively low.

    Additional visualizations will be created to analyze the regression model's performance, including a scatter plot, a residual plot, and a histogram of residuals. These will help identify any remaining discrepancies between predicted and actual values.

    In [41]:
    # Create subplots
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Scatter Plot
    axes[0].scatter(y_test, y_pred)
    axes[0].set_xlabel('True Values')
    axes[0].set_ylabel('Predictions')
    axes[0].set_title('Scatter Plot of True vs. Predicted Values')
    
    # Residual Plot
    residuals = y_test - y_pred
    axes[1].scatter(y_pred, residuals)
    axes[1].axhline(y=0, color='r', linestyle='--')
    axes[1].set_xlabel('Predicted Values')
    axes[1].set_ylabel('Residuals')
    axes[1].set_title('Residual Plot')
    
    # Histogram of Residuals
    sns.histplot(residuals, kde=True, ax=axes[2])
    axes[2].set_xlabel('Residuals')
    axes[2].set_ylabel('Frequency')
    axes[2].set_title('Histogram of Residuals')
    
    # Add title
    fig.suptitle('MLR with target variable "Average Temperature" and multiple predictor variables')
    
    # Adjust layout
    plt.tight_layout()
    
    # Show plots
    plt.show()
    

    Scatter Plot Analysis: The plot shows that predictions are not well-aligned with the true values; particularly, for true values in the mid-range, predictions are compressed towards the center of the range. This suggests a model bias towards average temperatures and difficulty in accurately predicting more extreme temperatures.

    Residual Plot Analysis: Residuals are not randomly dispersed around the zero line, which would indicate accurate predictions. Instead, there's a pronounced pattern where predictions for mid-range temperatures are too high (positive residuals) and for the extremes are too low (negative residuals), particularly at the lower end of the temperature scale.

    Histogram of Residuals Analysis: The histogram reveals a bimodal distribution, indicating two different patterns of residuals within the data. Instead of the expected normal distribution, this model's residuals suggest that there may be two distinct subsets of data, or that the model's linear assumptions are not fully capturing the underlying relationships.

    Conclusion: These plots confirm that while the model with multiple predictors performs better than the simpler models, it still struggles to accurately predict temperatures at the extremes and tends to predict closer to the mean value, leading to systematic errors. This suggests the need for further model refinement or the exploration of non-linear models to better capture the complexity of the relationships between the variables.

    Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)

    In [42]:
    # Create a meaningful name with a timestamp
    model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name, 'wb') as f:
        pickle.dump(regr, f)
    
    print(f"The MLR model has been saved to '{folder_path + model_name}'.")
    
    The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
    

    Multiple Linear Regression (MLR) with target variable "Average Winddirection" and multiple predictor variables

    In [43]:
    # Extracting the target variable and predictor variables
    y = results_df['avg_windspd']
    X = results_df.drop(columns=['avg_windspd'])
    
    # Splitting the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    
    #Training (Multiple Linear Regression)
    regr = LinearRegression()
    regr.fit(X, y)
    
    #  Fit Model (Multiple Linear Regression)
    y_pred = regr.predict(X_test)
    
    #Evaluate on Test Set (Multiple Linear Regression)
    mse = mean_squared_error(y_test, y_pred)
    print('Mean Squared Error:', mse)
    
    mae = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error:', mae)
    
    r2 = r2_score(y_test, y_pred)
    print('R-squared:', r2)
    
    Mean Squared Error: 0.5038677092424138
    Mean Absolute Error: 0.5636674787101482
    R-squared: 0.1563048200931636
    

    The R-squared value is extremely low, indicating poor model performance.

    In [44]:
    # Create subplots
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Scatter Plot
    axes[0].scatter(y_test, y_pred)
    axes[0].set_xlabel('True Values')
    axes[0].set_ylabel('Predictions')
    axes[0].set_title('Scatter Plot of True vs. Predicted Values')
    
    # Residual Plot
    residuals = y_test - y_pred
    axes[1].scatter(y_pred, residuals)
    axes[1].axhline(y=0, color='r', linestyle='--')
    axes[1].set_xlabel('Predicted Values')
    axes[1].set_ylabel('Residuals')
    axes[1].set_title('Residual Plot')
    
    # Histogram of Residuals
    sns.histplot(residuals, kde=True, ax=axes[2])
    axes[2].set_xlabel('Residuals')
    axes[2].set_ylabel('Frequency')
    axes[2].set_title('Histogram of Residuals')
    
    # Add title
    fig.suptitle('Multiple Linear Regression (MLR) with target variable "Average Winddirection" and multiple predictor variables')
    
    # Adjust layout
    plt.tight_layout()
    
    # Show plots
    plt.show()
    

    Scatter Plot Analysis: The scatter plot shows a lack of a definitive pattern or linear relationship, with data points widely spread out. Ideally, we would expect to see points along a diagonal line if the predictions closely matched the true values. The broad spread indicates the model's predictions vary significantly from the actual wind direction values, particularly for lower and higher true values.

    Residual Plot Analysis: The residual plot does not show the random scatter expected of a well-performing model. Instead, the spread of residuals indicates systematic errors in predictions across all ranges. The absence of a clear trend or clustering around the zero line suggests the model is not accurately capturing the relationship between wind direction and the predictors.

    Histogram of Residuals Analysis: The histogram reveals a right-skewed distribution, with a tail extending towards the positive side, suggesting that the model has a tendency to underpredict the wind direction. The peak of the histogram near zero indicates that for a significant number of predictions, the model's errors are small, but the long tail highlights that there are also many instances where the model's predictions are quite far off.

    Conclusion: These visualizations indicate that the model's ability to predict 'Average Winddirection' is limited. The skewness in the residual histogram and the widespread in the scatter plot suggest that the model's current predictors do not capture all the factors influencing wind direction, or that the relationship between wind direction and the predictors might be non-linear or influenced by other variables not included in the model. Further model refinement or the use of different modeling techniques, such as non-linear models or inclusion of interaction terms, might improve predictive performance.

    Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)

    In [45]:
    # Create a meaningful name with a timestamp
    model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name, 'wb') as f:
        pickle.dump(regr, f)
    
    print(f"The MLR model has been saved to '{folder_path + model_name}'.")
    
    The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
    

    Conclusion:
    The charts and metrics indicate that all models perform poorly. In general, it can be concluded that neither temperature nor wind speed can be well predicted based on the given data. The use of all evaluated features (from the feature selection process) represents an improvement compared to exclusively using the variables "Windspeed" and "Winddirection.
    This could be due to the fact that weather data often exhibits non-linear relationships that cannot be captured by a linear regression. Additionally, the presence of outliers and anomalies may distort the results. Given that weather events and temperature fluctuations frequently have highly complex underlying factors, these interactions might be too intricate to be adequately addressed by a Multiple Linear Regression. There is also the possibility of missing relevant variables that could enhance the prediction. It's essential to consider more sophisticated modeling approaches, such as non-linear regression or machine learning algorithms, to better capture the complexities inherent in weather data. Additionally, thorough data analysis, outlier handling, and consideration of domain-specific knowledge are crucial for improving the performance of the predictive models.
    Consequently, the multiple linear regression model is challenging to apply to the given data without further adjustments.

    5.3.3 Temperature Forecast¶

    After predicting the temperature parameter using a multiple predictor linear regression worked very well and produced good results, it should now be determined how, using statistical, big data and machine learning techniques, the temperature paramater can be forecasted. For this analysis training data from up until 2021 is used and data from the year 2022 will be predicted, thus simulating a temperature forecast. Different statistical time series and machine learning techniques are used for this approach. The result is not supposed to be a very exact prediction of the temperature for the year 2022 or even some single weeks or days in it, but rather a basic and general development and progression curve of the weather parameter for the given forecasted or predicted test data.

    5.3.3.1 SARIMAX MODEL¶

    5.3.3.1.1 Temperature and Wind Modeling over Time - Select Model (SARIMAX)

    The ARIMA (Auto-Regressive Integrated Moving Average) approach is one of the most cost-effective and reliable time series techniques. As these exogenous variable Wind should be included and seasonality is considered, the extended SARIMAX Model is utilzed here 1. SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors) models are among the most widely used statistical models for forecasting, with excellent forecasting performance 2.

    • Seasonal (S): SARIMAX accounts for seasonality in the time series data, allowing the model to capture recurring patterns or cycles.
    • AutoRegressive (AR) : SARIMAX includes the autoregressive component, which models the relationship between the current value and its past values.
    • Integrated (I): This component represents differencing of raw observations, which is used to make the time series data stationary by removing trends. Moving Average (MA): SARIMAX includes the moving average component, which models the relationship between the current value and past forecast errors.
    • eXogenous (X): The "X" in SARIMAX denotes exogenous variables, which are external factors that are not part of the time series but may influence it. These external variables are incorporated into the model to improve forecasting accuracy 3.

    By looking at the visualization of the temperature from above (chapter 5.3.2.2.3) some seasonality in data can be noticed. Besides the seasonality, it is also relevant to analyse the underlying trend. The examination of trends enables the identification of long-term patterns, aiding in the understanding of climate change and gradual shifts in meteorological conditions. Seasonality, on the other hand, unveils recurring patterns that follow a consistent cycle, offering insights into predictable variations such as temperature fluctuations and precipitation changes throughout the year. Meanwhile, scrutinizing residuals reveals irregularities and unexplained components in the data, allowing for the detection of anomalies or sudden deviations from expected weather patterns. This holistic exploration of trend, seasonality, and residuals not only enhances the ability to make accurate weather predictions but also empowers meteorologists and climate scientists to comprehend the broader dynamics of the atmosphere, contributing to a more comprehensive and nuanced understanding of weather phenomena.

    In the next step, a monthly time series decomposition in Trend, Seasonality and Trend will be performed:

    • Trend: The examination of trends enables the identification of long-term patterns, aiding in the understanding of climate change and gradual shifts in meteorological conditions.
    • Seasonality: The Seasonality, unveils recurring patterns that follow a consistent cycle, offering insights into predictable variations such as temperature fluctuations and precipitation changes throughout the year.
    • Residuals: Residuals are the remainder of aforementioned components. It reveals irregularities and unexplained components in the data, allowing for the detection of anomalies or sudden deviations from expected weather patterns.

    This holistic exploration of trend, seasonality, and residuals not only enhances understanding in the data and therefore ability to make accurate weather predictions but also empowers meteorologists and climate scientists to comprehend the broader dynamics of the atmosphere, contributing to a more comprehensive and nuanced understanding of weather phenomena.


    1. Elshewey, Ahmed M./Mahmoud Y. Shams/Abdelghafar M. Elhady/Samaa M. Shohieb/Abdelaziz A. Abdelhamid/Abdelhameed Ibrahim/Zahraa Tarek (2022): A novel WD-SARIMAX model for temperature forecasting using Daily Delhi Climate Dataset, in: Sustainability, vol. 15, no. 1, p. 757, [online] doi:10.3390/su15010757.↩

    2. Ortiz, Joaquin Amat Rodrigo and Javier Escobar (n.d.): Forecasting SARIMAX and ARIMA models - Skforecast Docs, [online] https://joaquinamatrodrigo.github.io/skforecast/0.7.0/user_guides/forecasting-sarimax-arima.html#.↩

    3. Nguyen, Bruce (2021): End-to-End Time Series Analysis and Forecasting: a Trio of SARIMAX, LSTM and Prophet (Part 1), in: Medium, 29.12.2021, [online] https://towardsdatascience.com/end-to-end-time-series-analysis-and-forecasting-a-trio-of-sarimax-lstm-and-prophet-part-1-306367e57db8.↩

    Create df for sarimax prediction:

    In [46]:
    # Extract relevant columns from the DataFrame
    df_temp_wind = df_full[["run_datetime", "avg_temp", "avg_windspd", "avg_winddir", "avg_windgust"]]
    
    # Set the 'run_datetime' column as the DataFrame index and convert it to datetime format
    df_temp_wind.index = pd.to_datetime(df_temp_wind['run_datetime'])
    
    # Drop the original 'run_datetime' column as it's now set as the index
    df_temp_wind.drop(columns='run_datetime', inplace=True)
    
    # Remove rows with duplicated indices, keeping only the first occurrence
    df_temp_wind = df_temp_wind.loc[~df_temp_wind.index.duplicated(keep='first')]
    
    # Display the first few rows of the resulting DataFrame
    df_temp_wind.head()
    
    Out[46]:
    avg_temp avg_windspd avg_winddir avg_windgust
    run_datetime
    2015-07-15 00:00:00 287.389224 3.386380 80.302464 11.136197
    2015-07-15 01:00:00 287.378997 3.326687 76.866373 11.002795
    2015-07-15 02:00:00 287.388845 3.243494 76.258867 10.700595
    2015-07-15 03:00:00 287.427324 3.145505 78.299616 10.323983
    2015-07-15 04:00:00 287.489158 3.047607 84.632852 9.921157

    The time series data will be decomposed into trend, seasonality, and residuals, allowing for an in-depth analysis of underlying patterns. This step is crucial for enhancing the predictive accuracy of the SARIMAX model in temperature and wind forecasting. By examining these components individually, a more precise understanding of weather dynamics can be achieved, leading to improved model performance and insight into climatic behaviors.

    In [47]:
    from statsmodels.tsa.seasonal import seasonal_decompose
    
    def decompose_and_plot(data, variable_name, position):
        # Decompose the time series data
        decompose_data = seasonal_decompose(data.asfreq('M'), model="additive", extrapolate_trend='freq')
        
        # Plot the original data
        plt.subplot(4, 4, position)
        data.plot()
        plt.title(f'Original Data of {variable_name}')
        
        # Plot the trend component
        plt.subplot(4, 4, position + 1)
        decompose_data.trend.plot()
        plt.title(f'Trend of {variable_name}')
        
        # Plot the seasonal component
        plt.subplot(4, 4, position + 2)
        decompose_data.seasonal.plot()
        plt.title(f'Seasonal of {variable_name}')
        
        # Plot the residual component
        plt.subplot(4, 4, position + 3)
        decompose_data.resid.plot()
        plt.title(f'Residual of {variable_name}')
    
    
    # Create a figure with a size of 16x12 inches
    plt.figure(figsize=(16, 12))
    
    # Decompose and plot for avg_temp
    decompose_and_plot(df_temp_wind["avg_temp"], "avg_temp", 1)
    
    # Decompose and plot for avg_winddir
    decompose_and_plot(df_temp_wind["avg_winddir"], "avg_winddir", 5)
    
    # Decompose and plot for avg_windspd
    decompose_and_plot(df_temp_wind["avg_windspd"], "avg_windspd", 9)
    
    # Decompose and plot for avg_windgust
    decompose_and_plot(df_temp_wind["avg_windgust"], "avg_windgust", 13)
    
    # Adjust layout to prevent overlap
    plt.tight_layout()
    
    # Display the plots
    plt.show()
    plt.savefig('../output/Linear Regression Analysis Multiple Predictors Seasonality and Trend.png')
    
    <Figure size 2000x1500 with 0 Axes>

    The provided time series plots offer a comprehensive view of the data's characteristics by breaking down each variable into its original, trend, seasonal, and residual components.

    Average Temperature (avg_temp):

    Original Data: Displays a clear, repetitive oscillation, indicative of strong seasonality in temperature data. Trend: There seems to be slight variability, with some periods showing a gentle rise or fall, but due to the nature of temperature data, which is highly seasonal, this doesn't necessarily indicate a long-term trend. Seasonality: Shows consistent and expected cyclical patterns corresponding to the seasons, with peaks typically in the warmer months and troughs in the cooler months. Residuals: Fluctuations around the zero line without any discernible pattern suggest that most of the systematic information has been captured by the trend and seasonal components.

    Average Wind Direction (avg_winddir):

    Original Data: Appears relatively constant over time, although there is huge volatility. Trend: No discernible long-term trend is visible; the data fluctuates around a mean without a clear direction. Seasonality: The seasonal plot does not show a clear pattern, which is typical for wind direction as it can be influenced by various unpredictable factors. Residuals: The residuals are large in magnitude compared to the other variables, reflecting the high variability in wind direction that is not captured by the seasonal or trend components.

    Average Wind Speed (avg_windspd):

    Original Data: Exhibits huge variance over time with what appears to be seasonal peaks and troughs. Trend: A very subtle decrease in wind speed is observed, but given the short time frame and inherent variability of wind speed data, it's hard to confirm this as a significant trend. Seasonality: Clear seasonal patterns are evident, likely reflecting the influence of larger atmospheric conditions that affect wind speed. Residuals: The residuals remain after removing the trend and seasonal effects, showing random variability which is typical for wind speed data.

    Average Wind Gust (avg_windgust):

    Original Data: Similar to average wind speed, it shows variability that suggests a seasonal influence. Trend: There is a suggestion of a decreasing trend, but as with wind speed, the inherent variability and short time frame make it difficult to ascertain a definitive trend. Seasonality: Displays a seasonal cycle with peaks and troughs, indicating that wind gusts are stronger in some seasons than others. Residuals: The residuals indicate the random fluctuations that are left after accounting for the systematic components of the data.

    Conclusion: The decomposition of each variable into trend, seasonality, and residuals has revealed that while there is a strong seasonal component across all variables, the trends are not as clear due to the short timespan and natural variability in the data. The residual plots do not exhibit any additional patterns, suggesting that the decomposed components capture the primary behaviors of the dataset. This analysis confirms the cyclical nature of weather-related data and underlines the importance of seasonal adjustment in forecasting models like SARIMAX. The lack of pattern in the residuals is a positive indicator that the model has adequately captured the data's structure.

    Augmented Dickey-Fuller Test:

    Hereinafter the Augmented Dickey-Fuller Test will be performed to determine wether the given time series is stationary or not. The ADF Test tests the null hypothesis that α=1. When the test statistic is lower than the critical value shown, the null hypothesis is rejcted and the time series is as stationary inferred 1. A stationary time series is one that has its mean, variance and autocorrelation structure unchanging overtime. In other words, it does not have any cycle/trend or seasonality.


    1. Prabhakaran, Selva (2022): Augmented Dickey Fuller Test (ADF Test) – must read guide, Machine Learning Plus, [online] https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/.↩

    In [48]:
    from statsmodels.tsa.stattools import adfuller
    
    def adf_test(variable, name):
        # Perform the Augmented Dickey-Fuller test
        dftest = adfuller(variable, autolag='AIC')
        
        # Print the results
        print(f"{name}: ")
        print(f"1. ADF : {dftest[0]}")
        print(f"2. P-Value : {dftest[1]}")
        print(f"3. Num Of Lags : {dftest[2]}")
        print(f"4. Num Of Observations Used For ADF Regression and Critical Values Calculation : {dftest[3]}")
        print("5. Critical Values :")
        for key, val in dftest[4].items():
            print(f"\t{key} : {val}")
        print("\n")
    
    # Perform ADF test for each variable
    adf_test(df_temp_wind["avg_temp"], "Average Temperature")
    adf_test(df_temp_wind["avg_winddir"], "Average Wind Direction")
    adf_test(df_temp_wind["avg_windspd"], "Average Wind Speed")
    adf_test(df_temp_wind["avg_windgust"], "Average Wind Gust")
    
    Average Temperature: 
    1. ADF : -5.6633939102963415
    2. P-Value : 9.268984051353608e-07
    3. Num Of Lags : 61
    4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65258
    5. Critical Values :
    	1% : -3.4304502108131385
    	5% : -2.8615842913495717
    	10% : -2.566793574780786
    
    
    Average Wind Direction: 
    1. ADF : -26.935153074526742
    2. P-Value : 0.0
    3. Num Of Lags : 54
    4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65265
    5. Critical Values :
    	1% : -3.4304502000646013
    	5% : -2.861584286598995
    	10% : -2.5667935722522013
    
    
    Average Wind Speed: 
    1. ADF : -26.388919718643297
    2. P-Value : 0.0
    3. Num Of Lags : 53
    4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65266
    5. Critical Values :
    	1% : -3.4304501985292846
    	5% : -2.861584285920424
    	10% : -2.566793571891019
    
    
    Average Wind Gust: 
    1. ADF : -24.04907190849269
    2. P-Value : 0.0
    3. Num Of Lags : 61
    4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65258
    5. Critical Values :
    	1% : -3.4304502108131385
    	5% : -2.8615842913495717
    	10% : -2.566793574780786
    
    
    

    It can be seen, that the ADF test statistic for average temperature is -5.6634 with a p-value of 9.27e-07, indicating a rejection of the null hypothesis of a unit root. The negative ADF value, lower than the critical values at the 1%, 5%, and 10% significance levels, supports the conclusion of stationarity in the average temperature time series.
    For the other variables with a p-value of 0, which is much less than the significance level of 0.05, the null hypothesis can be rejected, and the series can be considered stationary. Since the ADF tests have indicated stationarity for the variables, it can be proceeded to modelling.

    5.3.3.1.2 Temperature and Wind Modeling over Time - Training and Validation

    In the following the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are performed for the SARIMAX model selection and evaluation.
    These criteria provide quantitative measures of the trade-off between model goodness-of-fit and complexity, aiding in the identification of the most suitable model for a given time series. A lower AIC or BIC indicates a better-fitting model, considering both the accuracy in capturing the observed data and the penalization for model complexity. By systematically exploring various combinations of model parameters and assessing their AIC and BIC values, practitioners can effectively navigate the model space, improving the likelihood of selecting a SARIMAX model that strikes an optimal balance between explanatory power and simplicity, leading to more reliable and generalizable predictions 1.


    1. Zach (2021): How to calculate AIC of regression models in Python, Statology, [online] https://www.statology.org/aic-in-python/.↩

    Akaike Information Criterion (AIC):

    In [49]:
    import itertools
    
    # Define response variable
    y = df_temp_wind['avg_temp']
    
    # Define predictor variables
    predictor_vars = ['avg_windspd', 'avg_winddir', 'avg_windgust']
    
    # Initialize variables to store best AIC and best predictor combination
    best_aic = float('inf')
    best_predictors = None
    
    # Iterate through all combinations of predictor variables
    for r in range(1, len(predictor_vars) + 1):
        for combo in itertools.combinations(predictor_vars, r):
            # Select the current combination of predictors
            current_predictors = list(combo)
            
            # Define the predictors and add a constant
            x = df_temp_wind[current_predictors]
            x = sm.add_constant(x)
    
            # Fit the regression model
            model = sm.OLS(y, x).fit()
    
            # Update best predictors if the current AIC is lower
            if model.aic < best_aic:
                best_aic = model.aic
                best_predictors = current_predictors
    
    # Print the best combination of predictors and its AIC
    print(f"Best Predictors: {best_predictors}")
    print(f"Best AIC: {best_aic}")
    
    Best Predictors: ['avg_windspd', 'avg_winddir']
    Best AIC: 502202.826219331
    

    Bayesian Information Criterion (BIC):

    In [50]:
    # Define response variable
    y = df_temp_wind['avg_temp']
    
    # Define predictor variables
    predictor_vars = ['avg_windspd', 'avg_winddir','avg_windgust']
    
    # Initialize variables to store best BIC and best predictor combination
    best_bic = float('inf')
    best_predictors_bic = None
    
    # Iterate through all combinations of predictor variables
    for r in range(1, len(predictor_vars) + 1):
        for combo in itertools.combinations(predictor_vars, r):
            # Select the current combination of predictors
            current_predictors = list(combo)
            
            # Define the predictors and add a constant
            x = df_temp_wind[current_predictors]
            x = sm.add_constant(x)
    
            # Fit the regression model
            model = sm.OLS(y, x).fit()
    
            # Update best predictors if the current BIC is lower
            if model.bic < best_bic:
                best_bic = model.bic
                best_predictors_bic = current_predictors
    
    # Print the best combination of predictors and its BIC
    print(f"Best Predictors (BIC): {best_predictors_bic}")
    print(f"Best BIC: {best_bic}")
    
    Best Predictors (BIC): ['avg_windspd', 'avg_winddir']
    Best BIC: 502230.0873799723
    

    AIC and BIC both shows that the best predictor variables are 'avg_windspd' and 'avg_winddir'.

    Auto ARIMA
    AutoARIMA is an automated time series forecasting algorithm based on the ARIMA (AutoRegressive Integrated Moving Average) model. The algorithm automates the process of selecting the optimal model parameters (p, d, q) by searching through different combinations and choosing the model with the best fit based on a specified criterion, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). It integrates both non-seasonal and seasonal components to capture patterns and trends in time series data effectively.

    For the given dataset, which consists of hourly temperature readings, the choice of AutoARIMA parameters is guided by the characteristics of the time series. The m=24 parameter is selected to account for the potential daily seasonality, assuming a repeating pattern every 24 hours. Starting with start_p=1 and start_q=1 provides a sensible initial exploration of autoregressive and moving average components. The max_p=3 and max_q=3 values limit the search space for these components, preventing excessive model complexity. The inclusion of D=1 indicates that seasonal differencing is considered, and trace=True allows for the monitoring of the algorithm's progress. These choices aim to strike a balance between capturing the underlying patterns in the data while preventing overfitting.

    In [51]:
    from pmdarima import auto_arima
    
    auto_arima_results= auto_arima(df_temp_wind["avg_temp"], 
               start_p=1,
               start_q=1,
               max_p=3,
               max_q=3, # maximum p and q
               m=24,              # frequency of series
               d=None,           # let model determine 'd'
               seasonal=False,   # No Seasonality
               start_P=0, 
               D=1,  
               trace=True,
               error_action='ignore',  
               suppress_warnings=True, 
               stepwise=True)
    
    auto_arima_results.summary()
    
    Performing stepwise search to minimize aic
     ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=-300996.225, Time=7.82 sec
     ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=-30853.412, Time=2.62 sec
     ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=-272301.363, Time=3.90 sec
     ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=-116812.636, Time=10.22 sec
     ARIMA(0,1,0)(0,0,0)[0]             : AIC=-30855.181, Time=1.35 sec
     ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=-315512.530, Time=3.67 sec
     ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=-315504.206, Time=4.01 sec
     ARIMA(3,1,1)(0,0,0)[0] intercept   : AIC=-315506.573, Time=3.23 sec
     ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=-315513.680, Time=8.53 sec
     ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=-309847.293, Time=22.26 sec
     ARIMA(3,1,2)(0,0,0)[0] intercept   : AIC=-309557.825, Time=40.41 sec
     ARIMA(2,1,3)(0,0,0)[0] intercept   : AIC=-315534.509, Time=12.44 sec
     ARIMA(1,1,3)(0,0,0)[0] intercept   : AIC=-312901.486, Time=18.11 sec
     ARIMA(3,1,3)(0,0,0)[0] intercept   : AIC=-315657.922, Time=10.30 sec
     ARIMA(3,1,3)(0,0,0)[0]             : AIC=-315828.456, Time=23.75 sec
     ARIMA(2,1,3)(0,0,0)[0]             : AIC=-315536.495, Time=5.94 sec
     ARIMA(3,1,2)(0,0,0)[0]             : AIC=inf, Time=30.87 sec
     ARIMA(2,1,2)(0,0,0)[0]             : AIC=-315515.665, Time=4.10 sec
    
    Best model:  ARIMA(3,1,3)(0,0,0)[0]          
    Total fit time: 213.543 seconds
    
    Out[51]:
    SARIMAX Results
    Dep. Variable: y No. Observations: 65320
    Model: SARIMAX(3, 1, 3) Log Likelihood 157921.228
    Date: Fri, 29 Dec 2023 AIC -315828.456
    Time: 18:07:02 BIC -315764.847
    Sample: 0 HQIC -315808.775
    - 65320
    Covariance Type: opg
    coef std err z P>|z| [0.025 0.975]
    ar.L1 2.4725 0.021 116.934 0.000 2.431 2.514
    ar.L2 -2.0030 0.038 -52.033 0.000 -2.078 -1.928
    ar.L3 0.5260 0.018 29.575 0.000 0.491 0.561
    ma.L1 -0.8029 0.021 -37.572 0.000 -0.845 -0.761
    ma.L2 -0.0242 0.005 -5.296 0.000 -0.033 -0.015
    ma.L3 -0.0307 0.006 -5.555 0.000 -0.042 -0.020
    sigma2 0.0005 1.2e-06 384.413 0.000 0.000 0.000
    Ljung-Box (L1) (Q): 4.99 Jarque-Bera (JB): 230383.67
    Prob(Q): 0.03 Prob(JB): 0.00
    Heteroskedasticity (H): 1.00 Skew: -0.09
    Prob(H) (two-sided): 0.91 Kurtosis: 12.20


    Warnings:
    [1] Covariance matrix calculated using the outer product of gradients (complex-step).

    The analysis using the AutoARIMA model has provided insights into the time series data for average temperature (avg_temp). The selected model parameters for the SARIMAX model are (3,1,3)(0,0,0)[0], where:

    p (order of autoregressive term): 3 d (degree of differencing): 1 q (order of moving average term): 3 P (seasonal order of autoregressive term): 0 D (seasonal degree of differencing): 0 Q (seasonal order of moving average term): 0 m (seasonal frequency): 24 (daily seasonality) Here is a summary of the key findings:

    Model Fit: The SARIMAX model with these parameters achieves an impressive Log Likelihood of 157,921.228, indicating a strong fit to the data.

    Information Criteria: The Akaike Information Criterion (AIC) is -315,828.456, and the Bayesian Information Criterion (BIC) is -315,764.847. These values are significant improvements over alternative models, suggesting that this SARIMAX model is the best choice for forecasting temperature based on the AIC and BIC criteria.

    Model Coefficients: The model coefficients indicate significant autoregressive (AR) and moving average (MA) terms. The coefficients are within acceptable ranges, indicating stability and significance.

    Residuals: The Ljung-Box test (Q) indicates that there is no significant autocorrelation in the residuals, and the Jarque-Bera test (JB) suggests that the residuals are normally distributed. This confirms that the model has successfully captured the underlying patterns in the data.

    Heteroskedasticity and Skew: The model does not exhibit heteroskedasticity (significant H value), and the skewness is close to zero, indicating a well-behaved model.

    Conclusion: The SARIMAX model with the parameters (3,1,3)(0,0,0)[0] and a seasonal frequency of 24 provides an excellent fit to the time series data for average temperature. The model's ability to capture the data's underlying patterns is reflected in the high Log Likelihood and the absence of significant autocorrelation in the residuals. Additionally, both the AIC and BIC criteria support the selection of this model as the best predictor for temperature forecasting.

    This robust model can be used to make accurate temperature forecasts and gain valuable insights into temperature variations, which is crucial for various applications such as weather forecasting and climate analysis.

    The next step is the diagnostic phase, which visualizes the model's diagnostics. Visualizing the diagnostic results allows us to assess how well the SARIMAX model adapts to the temperature data, ensuring the reliability of its predictions. It can help evaluate its ability to capture data patterns and identify any discrepancies between observed and predicted values. Furthermore visualization helps verify whether the model's assumptions are met, such as the normal distribution of residuals and the absence of autocorrelation. By validating these assumptions, we ensure the model's robustness and reliability.

    In [52]:
    auto_arima_results.plot_diagnostics(figsize=(10,8))
    plt.show()
    

    Standardized Residuals:

    The top-left plot displays standardized residuals over time, representing the differences between observed values and the model's predictions, standardized by the standard deviation of the residuals. A model that fits well would show residuals scattered randomly around zero without discernible patterns. In the provided plot, no obvious trends or seasonality are observed in the residuals, suggesting that the model has adequately captured the underlying structure of the data.

    Histogram plus Estimated Density:

    The top-right plot features a histogram of the residuals with a Kernel Density Estimate (KDE) and a Normal distribution curve superimposed. The histogram illustrates the frequency distribution of the residuals, while the KDE provides a smooth estimate of this distribution. The closer the KDE and the histogram align with the normal curve, the more the residuals approximate a normal distribution. The plot indicates a close fit between the KDE and the Normal distribution curve, suggesting that the residuals are nearly normally distributed, which meets an assumption of the ARIMA model.

    Normal Q-Q Plot:

    The bottom-left plot, the Quantile-Quantile plot, compares the quantiles of the residuals to the quantiles of a Normal distribution. Points on the Q-Q plot should lie approximately on the 45-degree reference line if the residuals are normally distributed. Some deviation from the line at the extremes is evident in the Q-Q plot, indicating that the extreme values of the residuals may deviate from a perfect normal distribution. However, the majority of points do closely follow the reference line.

    Correlogram:

    The bottom-right plot, the correlogram, depicts the autocorrelation of the residuals at various lags. For a well-fitted model, the autocorrelations for all lag values are expected to lie within the confidence interval, denoting an absence of significant autocorrelation in the residuals. The correlogram shows all autocorrelations within the expected confidence bounds, implying that the residuals are free from autocorrelation, thus indicating that the model has effectively captured the time series dependencies. These diagnostic plots suggest that the SARIMAX model with the specified parameters is adept at fitting to the average temperature data. The residuals exhibit the expected behavior for a well-specified model, with no evident patterns, a distribution approaching normality, and insignificant autocorrelation. This supports the use of the model for forecasting average temperature, providing assurance in its predictive reliability.

    5.3.3.1.3 Temperature and Wind Modeling over Time - Fit Model (SARIMAX)

    In the next step, a SARIMAX model is fitted to the temperature and wind data over time. This model is chosen because it can incorporate external factors like wind speed and direction, which influence temperature. Additionally, SARIMAX can handle seasonal patterns often found in weather data. The training and test datasets are defined, exogenous variables (wind speed and direction) are specified, hyperparameters are set, and the SARIMAX model is fitted. Key information about the model's performance and parameters is provided in the model summary. This step is crucial for accurate temperature forecasting and understanding the temperature-wind relationship.

    In [53]:
    from statsmodels.tsa.statespace.sarimax import SARIMAX
    
    #manually define train and test data
    train_df = df_temp_wind[df_temp_wind.index <= datetime(2022,12,15)]
    test_df = df_temp_wind[df_temp_wind.index > datetime(2022,12,15)]
    
    
    # Define exogenous variables
    exog_vars = ['avg_windspd', 'avg_winddir']
    
    # Set the frequency to hourly
    train_df= train_df.asfreq('1H') 
    #Delete missing values
    train_df = train_df.dropna(subset=exog_vars)
    
    # Set Hyper-Parameters
    p, d, q = 3, 1, 3
    P, D, Q = 0, 0, 0
    s = 24
    
    # Fit SARIMAX
    sarimax_model = SARIMAX(train_df['avg_temp'],
                    order=(p, d, q),
                    seasonal_order=(P, D, Q, s),
                    exog=train_df[exog_vars])
    
    sarimax_model_fit = sarimax_model.fit()
    
    # Summary
    print(sarimax_model_fit.summary())
    
    c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:
    
    A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
    
    c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:
    
    A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
    
    
                                   SARIMAX Results                                
    ==============================================================================
    Dep. Variable:               avg_temp   No. Observations:                65024
    Model:               SARIMAX(3, 1, 3)   Log Likelihood              157588.046
    Date:                Fri, 29 Dec 2023   AIC                        -315158.091
    Time:                        18:08:09   BIC                        -315076.349
    Sample:                             0   HQIC                       -315132.795
                                  - 65024                                         
    Covariance Type:                  opg                                         
    ===============================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
    -------------------------------------------------------------------------------
    avg_windspd     0.1454      0.004     36.921      0.000       0.138       0.153
    avg_winddir -3.687e-05   9.41e-06     -3.918      0.000   -5.53e-05   -1.84e-05
    ar.L1           2.4614      0.013    196.384      0.000       2.437       2.486
    ar.L2          -1.9693      0.023    -84.233      0.000      -2.015      -1.924
    ar.L3           0.5050      0.011     45.665      0.000       0.483       0.527
    ma.L1          -0.8106      0.013    -64.192      0.000      -0.835      -0.786
    ma.L2          -0.0463      0.004    -10.490      0.000      -0.055      -0.038
    ma.L3          -0.0673      0.006    -12.051      0.000      -0.078      -0.056
    sigma2          0.0005   1.27e-06    360.363      0.000       0.000       0.000
    ===================================================================================
    Ljung-Box (L1) (Q):                   6.34   Jarque-Bera (JB):            184518.11
    Prob(Q):                              0.01   Prob(JB):                         0.00
    Heteroskedasticity (H):               1.01   Skew:                            -0.06
    Prob(H) (two-sided):                  0.58   Kurtosis:                        11.25
    ===================================================================================
    
    Warnings:
    [1] Covariance matrix calculated using the outer product of gradients (complex-step).
    

    It achieved a Log Likelihood of 157,588.046, indicating a strong fit to the data. Both the AIC (-315,158.091) and BIC (-315,076.349) criteria support the selection of this SARIMAX model as the best predictor for temperature forecasting.

    5.3.3.1.4 Temperature and Wind Modeling over Time - Evaluation on test set (SARIMAX)

    Next, the SARIMAX model's performance will be evaluated on the test set. This evaluation is crucial to assess how well the model can make accurate temperature forecasts on unseen data.

    Generate forecasts for the test period using the SARIMAX model, matching the length of the test dataset.

    Plot both the actual average temperature values and the forecasted values to visually compare their trends. This evaluation will provide insights into the model's ability to predict temperature based on wind-related exogenous variables and its real-world forecasting performance.

    In [54]:
    # Get the forecast for the test period
    forecast_steps = len(test_df)
    forecast_index = test_df.index  # Use the time index of the test dataframe
    
    forecast = sarimax_model_fit.get_forecast(steps=forecast_steps, exog=test_df[exog_vars])
    
    # Plot the actual values
    plt.plot(test_df['avg_temp'], label='Actual')
    
    # Plot the forecasted values
    plt.plot(forecast_index, forecast.predicted_mean, label='Forecast', color='red')
    
    # Plot confidence intervals
    #plt.fill_between(forecast_index, forecast.conf_int()['lower avg_temp'], forecast.conf_int()['upper avg_temp'], color='pink', alpha=0.3)
    
    # Set labels and title
    plt.xlabel('Time')
    plt.ylabel('Average Temperature')
    plt.title('SARIMAX Forecast on Test Data')
    plt.legend()
    
    # Show the plot
    plt.show()
    plt.savefig('../output/Linear Regression Analysis Multiple Predictors SARIMAX Forecast Results.png')
    
    c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning:
    
    No supported index is available. Prediction results will be given with an integer index beginning at `start`.
    
    
    <Figure size 2000x1500 with 0 Axes>

    The plot presents a comparison between actual and forecasted average temperature values, providing a visual representation of the SARIMAX model's forecasting performance. Here is an explanation of the plot elements in relation to the provided conclusion: The actual average temperatures, shown in blue, display significant short-term fluctuations characterized by sharp peaks and troughs, indicative of the complex variability inherent in the temperature data over the test period. The forecasted temperatures, depicted in red, show the model's attempt to predict these temperatures. The forecast line is noticeably smoother and fails to match the magnitude and frequency of the observed temperature variations, particularly missing the sharpness of the peaks. At the start, the model seems to approximate the initial peak in temperature but does not fully capture its height or the subsequent drop, which is indicative of the model's limitations in predicting sudden changes. As the plot progresses, the model's forecasted values consistently overestimate the actual temperatures, with the forecast line situated above the actual temperature line most of the time. This overestimation aligns with the conclusion that the model tends to predict higher temperatures than actually observed, suggesting a systematic bias in the model.

    The significant divergence between the forecasted and actual values, especially in the latter half of the plot where the model fails to predict the sharp rise and fall in temperatures, underscores the model's struggle to accurately capture the dynamic relationship between temperature and wind-related exogenous variables.

    The absence of confidence interval bands in the plot means the uncertainty in the forecasts is not visualized, which could have provided additional insights into the model's predictive performance.

    The graphic analysis reveals that the SARIMAX model struggles to accurately forecast the average temperature based on wind. While the initial peak is reasonably predicted, subsequent fluctuations in the data are not adequately captured. The forecast consistently exhibits a tendency to overestimate temperatures, with an average deviation of approximately 7 degrees higher than the observed values. This discrepancy suggests a systematic bias in the prediction, indicating that the SARIMAX model may not effectively account for certain influential factors related to wind-induced temperature variations. Consequently, the overall performance of the SARIMAX prediction falls short, highlighting the need for further refinement or exploration of alternative modeling approaches to enhance accuracy.
    Exploring the dataset more comprehensively and fine-tuning the model based on a thorough understanding of the data patterns may contribute to enhancing the SARIMAX model's performance. This involves adjusting hyperparameters and making refinements to better align with the specific characteristics of the data, ultimately leading to more accurate predictions.

    The next step involves quantitatively evaluating the SARIMAX model through key statistical metrics to measure accuracy and generalization capabilities. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) will be reviewed to assess model fit and complexity, while the Mean Squared Error (MSE) and Sum Squared Error (SSE) will quantify prediction error. These metrics are crucial for determining the model's forecasting efficacy and pinpointing areas that may require refinement.

    In [55]:
    # Get evaluation data
    sarimax_aic = sarimax_model_fit.aic
    sarimax_bic = sarimax_model_fit.bic
    sarimax_mean_squared_error = sarimax_model_fit.mse
    sarimax_sum_squared_error=sarimax_model_fit.sse
    
    print(f'Akaike information criterion | AIC: {sarimax_aic}')
    print(f'Bayesian information criterion | BIC: {sarimax_bic}')
    print(f'Mean Squared Error | MSE: {sarimax_mean_squared_error}')
    print(f'Sum Squared Error | SSE: {sarimax_sum_squared_error}')
    
    Akaike information criterion | AIC: -315158.0914554971
    Bayesian information criterion | BIC: -315076.3489885051
    Mean Squared Error | MSE: 1.2663238330051534
    Sum Squared Error | SSE: 82341.4409173271
    

    5.3.3.1.5 Temperature and Wind Modeling over Time - Save Model (SARIMAX)

    In [56]:
    # Create a meaningful name with a timestamp
    model_name = f"sarimax_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name, 'wb') as f:
        pickle.dump(sarimax_model_fit, f)
    
    print(f"The SARIMAX model has been saved to '{folder_path + model_name}'.")
    
    The SARIMAX model has been saved to '../models/sarimax_model_20231229_1808.pkl'.
    
    5.3.3.2 XGBOOST¶

    5.3.3.2.1 Temperature and Wind Modeling over Time - Select Model (Lazy Regressor)

    “Progress isn’t made by early risers. It’s made by lazy men trying to find easier ways to do something.” ― Robert Heinlein

    Usually, an "educated guess" is beeing made when it comes to model selection. The developer chooses some models, which he (or she) considers the best model for the given use case. After implementing the SARIMAX approach the library LazyPredict will now be used to determine the best models.

    LazyPredict is a valuable tool for model selection, particularly when dealing with a large set of potential models and limited time for exhaustive evaluation. In the context of time series forecasting with exogenous variables, where various models such as SARIMAX, linear regression, and machine learning algorithms can be considered, LazyPredict streamlines the model comparison process by automatically assessing the performance of multiple models.
    It provides a quick overview of model performances without the need for extensive manual tuning. In this specific use case, where accurate temperature predictions are crucial, the metric of interestis the Root Mean Squared Error (RMSE), as these metrics directly quantify the predictive accuracy of the models. By leveraging LazyPredict, one can efficiently identify promising model candidates and subsequently fine-tune the selected models for optimal performance based on the specific metric deemed most important for the forecasting task.

    In [57]:
    #from lazypredict.Supervised import  LazyRegressor
    #from sklearn.model_selection import train_test_split
    
    ## Define predictor variables
    #X = df_temp_wind[['avg_windspd', 'avg_winddir','avg_windgust']]
    ## Define response variable
    #y = df_temp_wind['avg_temp']
    
    #X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=123)
    
    #reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
    #models,pred = reg.fit(X_train, X_test, y_train, y_test)
    
    #print(models)
    
    ## takes too long to run
    

    Results

    Model Adjusted R-Squared R-Squared RMSE Time Taken
    XGBRegressor 0.13 0.13 10.66 0.44
    HistGradientBoostingRegressor 0.12 0.12 10.72 0.80
    LGBMRegressor 0.12 0.12 10.72 0.17
    RandomForestRegressor 0.12 0.12 10.72 69.52
    ExtraTreesRegressor 0.10 0.10 10.83 14.12
    MLPRegressor 0.08 0.08 10.95 30.44
    GradientBoostingRegressor 0.08 0.08 10.95 14.94
    NuSVR 0.07 0.07 11.02 127.68
    SVR 0.06 0.06 11.09 129.99
    KNeighborsRegressor 0.05 0.05 11.11 0.11
    BaggingRegressor 0.05 0.05 11.13 7.32
    AdaBoostRegressor 0.03 0.03 11.25 1.34
    LassoCV 0.01 0.01 11.34 0.42
    LassoLarsCV 0.01 0.01 11.34 0.06
    Lars 0.01 0.01 11.34 0.17
    LarsCV 0.01 0.01 11.34 0.15
    TransformedTargetRegressor 0.01 0.01 11.34 0.03
    LassoLarsIC 0.01 0.01 11.34 0.04
    LinearRegression 0.01 0.01 11.34 0.02
    OrthogonalMatchingPursuitCV 0.01 0.01 11.34 0.05
    Ridge 0.01 0.01 11.34 0.02
    RidgeCV 0.01 0.01 11.34 0.03
    ElasticNetCV 0.01 0.01 11.34 0.30
    BayesianRidge 0.01 0.01 11.34 0.03
    PoissonRegressor 0.01 0.01 11.34 0.05
    SGDRegressor 0.01 0.01 11.34 0.07
    OrthogonalMatchingPursuit 0.01 0.01 11.35 0.02
    TweedieRegressor 0.01 0.01 11.35 0.02
    GammaRegressor 0.01 0.01 11.35 0.14
    ElasticNet 0.01 0.01 11.36 0.03
    LassoLars 0.01 0.01 11.39 0.02
    Lasso 0.01 0.01 11.39 0.04
    HuberRegressor 0.00 0.01 11.39 0.14
    LinearSVR 0.00 0.00 11.42 0.10
    DummyRegressor -0.00 -0.00 11.42 0.01
    PassiveAggressiveRegressor -0.20 -0.20 12.51 0.04
    RANSACRegressor -0.48 -0.47 13.87 0.13
    DecisionTreeRegressor -0.64 -0.64 14.64 0.86
    ExtraTreeRegressor -0.69 -0.68 14.82 0.15

    --> in the following steps the model with the lowest RMSE value, XGB Regressor, will be implemented.

    5.3.3.2.2 Temperature and Wind Modeling over Time - Training and Validation (XGB Regressor)

    In [58]:
    # Extract features and target
    features = ['avg_windspd', 'avg_winddir']
    target = 'avg_temp'
    
    X = df_temp_wind[features]
    y = df_temp_wind[target]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create time-related features
    def create_time_feature(data):
            data['dayofmonth'] = data.index.day
            data['dayofweek'] = data.index.dayofweek
            data['quarter'] = data.index.quarter
            data['month'] = data.index.month
            data['year'] = data.index.year
            data['dayofyear'] = data.index.dayofyear
            data['weekofyear'] = data.index.isocalendar().week
    
    create_time_feature(df_temp_wind)
    df_temp_wind.head()
    
    Out[58]:
    avg_temp avg_windspd avg_winddir avg_windgust dayofmonth dayofweek quarter month year dayofyear weekofyear
    run_datetime
    2015-07-15 00:00:00 287.389224 3.386380 80.302464 11.136197 15 2 3 7 2015 196 29
    2015-07-15 01:00:00 287.378997 3.326687 76.866373 11.002795 15 2 3 7 2015 196 29
    2015-07-15 02:00:00 287.388845 3.243494 76.258867 10.700595 15 2 3 7 2015 196 29
    2015-07-15 03:00:00 287.427324 3.145505 78.299616 10.323983 15 2 3 7 2015 196 29
    2015-07-15 04:00:00 287.489158 3.047607 84.632852 9.921157 15 2 3 7 2015 196 29

    5.3.3.2.3 Temperature and Wind Modeling over Time - Fit Model (XGB Regressor)

    In [59]:
    from sklearn.model_selection import TimeSeriesSplit
    import xgboost as xgb
    
    
    model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, max_depth=3, learning_rate=0.1, alpha=0.5)
    
    
    # Initialize time aware cross-validation
    tscv = TimeSeriesSplit(n_splits=5)
    
    # Perform cross-validation
    for train_index, test_index in tscv.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        model.fit(X_train, y_train)
    
    In [60]:
    test_predictions_index = X_test.index
    predictions = model.predict(X_test)
    
    predictions_df = pd.DataFrame(predictions, index=test_predictions_index, columns=['vg_temp_prediction'])
    

    5.3.3.2.4 Temperature and Wind Modeling over Time - Evaluation on Test set (XGB Regressor)

    In [61]:
    comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions_df['vg_temp_prediction']})
    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(comparison_df['Actual'], comparison_df['Predicted'])
    print(f'Mean Squared Error: {mse}')
    
    # Calculate Mean Absolute Error (MAE)
    mae = mean_absolute_error(comparison_df['Actual'], comparison_df['Predicted'])
    print(f'Mean Absolute Error: {mae}')
    
    Mean Squared Error: 125.31079889302593
    Mean Absolute Error: 9.367344482232422
    
    In [62]:
    # 'comparison_df' enthält die tatsächlichen und vorhergesagten Werte
    plt.figure(figsize=(12, 6))
    plt.plot(comparison_df.index, comparison_df['Actual'], label='Actual',)
    plt.plot(comparison_df.index, comparison_df['Predicted'], label='Predicted', )
    plt.xlabel('Zeit')
    plt.ylabel('vg_temp')
    plt.title('Predictions vs. Actual Values')
    plt.legend()
    plt.show()
    

    5.3.3.2.5 Temperature and Wind Modeling over Time - Save Model (XGB Regressor)

    In [63]:
    # Create a meaningful name with a timestamp
    model_name = f"xgb_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name, 'wb') as f:
        pickle.dump(model, f)
    
    print(f"The SARIMAX model has been saved to '{folder_path + model_name}'.")
    
    The SARIMAX model has been saved to '../models/xgb_model_20231229_1808.pkl'.
    

    5.3.4. Temperature Prediction (Temporal):¶

    Temperature prediction involves exploring the relationships between temperature and time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.

    Is it possible to build an accurate regression model to predict temperature based on historical data? This involves exploring the relationships between temperature and windspeed over time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.

    In [64]:
    # Cloning DataFrame to manipulate it for the forecast prediction
    forecast_data = pd.DataFrame(df)
    
    # Convert timestamp to datetime format
    forecast_data['timestamp'] = pd.to_datetime(forecast_data['run_datetime'])
    
    # Extract day of the year and hour of the day
    forecast_data['day_of_year'] = forecast_data['timestamp'].dt.dayofyear
    forecast_data['hour'] = forecast_data['timestamp'].dt.hour
    
    # Assuming 'avg_temp' is the target variable
    X = forecast_data[['day_of_year', 'hour']]
    y = forecast_data['avg_temp']
    
    5.3.4.1 Temperature Prediction (Temporal): Select model¶
    In [65]:
    # Create models
    linear_regression_model = make_pipeline(LinearRegression())
    gradient_boosting_model = GradientBoostingRegressor()
    sgd_regressor_model = make_pipeline(StandardScaler(), SGDRegressor())
    svr_model = SVR()
    
    5.3.4.2 Temperature Prediction (Temporal): Training and validation¶
    In [66]:
    # Split the data into training and testing sets
    train_mask = (forecast_data['timestamp'].dt.year >= 2015) & (forecast_data['timestamp'].dt.year <= 2021)
    test_mask = (forecast_data['timestamp'].dt.year == 2022)
    
    X_train, X_test = X[train_mask], X[test_mask]
    y_train, y_test = y[train_mask], y[test_mask]
    
    5.3.4.3 Temperature Prediction (Temporal): Fit model¶
    In [67]:
    # Train models
    linear_regression_model_fit = linear_regression_model.fit(X_train, y_train)
    gradient_boosting_model_fit = gradient_boosting_model.fit(X_train, y_train)
    sgd_regressor_model_fit = sgd_regressor_model.fit(X_train, y_train)
    svr_model_fit = svr_model.fit(X_train, y_train)
    
    5.3.4.4 Temperature Prediction (Temporal): Evaluation on test set¶
    In [68]:
    # Make predictions on the test set
    y_pred_linear_regression = linear_regression_model_fit.predict(X_test)
    y_pred_gradient_boosting = gradient_boosting_model_fit.predict(X_test)
    y_pred_sgd_regressor = sgd_regressor_model_fit.predict(X_test)
    y_pred_svr = svr_model_fit.predict(X_test)
    
    # Calculate residuals
    residuals_linear_regression = y_test - y_pred_linear_regression
    residuals_gradient_boosting = y_test - y_pred_gradient_boosting
    residuals_sgd_regressor = y_test - y_pred_sgd_regressor
    residuals_svr = y_test - y_pred_svr
    
    5.3.4.5 Temperature Prediction (Temporal): Visualize results¶
    In [69]:
    # Calculate MSE, MAE, and Mean Residual Error for each model
    mse_linear_regression = mean_squared_error(y_test, y_pred_linear_regression)
    mae_linear_regression = mean_absolute_error(y_test, y_pred_linear_regression)
    mean_residual_error_linear_regression = np.mean(residuals_linear_regression)
    
    mse_gradient_boosting = mean_squared_error(y_test, y_pred_gradient_boosting)
    mae_gradient_boosting = mean_absolute_error(y_test, y_pred_gradient_boosting)
    mean_residual_error_gradient_boosting = np.mean(residuals_gradient_boosting)
    
    mse_sgd_regressor = mean_squared_error(y_test, y_pred_sgd_regressor)
    mae_sgd_regressor = mean_absolute_error(y_test, y_pred_sgd_regressor)
    mean_residual_error_sgd_regressor = np.mean(residuals_sgd_regressor)
    
    mse_svr = mean_squared_error(y_test, y_pred_svr)
    mae_svr = mean_absolute_error(y_test, y_pred_svr)
    mean_residual_error_svr = np.mean(residuals_svr)
    
    # Visualize the predictions, residuals, and tolerance zone
    plt.figure(figsize=(15, 20))
    
    # Linear Regression subplot
    plt.subplot(4, 2, 1)
    plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
    plt.plot(forecast_data[test_mask]['timestamp'], y_pred_linear_regression, label=f'Linear Regression Predicted\nMSE: {mse_linear_regression:.2f}, MAE: {mae_linear_regression:.2f}, Mean Residual: {mean_residual_error_linear_regression:.2f}')
    plt.xlabel('Timestamp')
    plt.ylabel('Temperature')
    plt.title('Linear Regression')
    plt.legend()
    
    # Linear Regression Residuals subplot
    plt.subplot(4, 2, 2)
    plt.scatter(y_pred_linear_regression, residuals_linear_regression)
    plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
    plt.xlabel('Predicted Temperature')
    plt.ylabel('Residuals')
    plt.title('Linear Regression Residuals')
    plt.legend()
    
    # Gradient Boosting Regressor subplot
    plt.subplot(4, 2, 3)
    plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
    plt.plot(forecast_data[test_mask]['timestamp'], y_pred_gradient_boosting, label=f'Gradient Boosting Predicted\nMSE: {mse_gradient_boosting:.2f}, MAE: {mae_gradient_boosting:.2f}, Mean Residual: {mean_residual_error_gradient_boosting:.2f}')
    plt.xlabel('Timestamp')
    plt.ylabel('Temperature')
    plt.title('Gradient Boosting Regressor')
    plt.legend()
    
    # Gradient Boosting Residuals subplot
    plt.subplot(4, 2, 4)
    plt.scatter(y_pred_gradient_boosting, residuals_gradient_boosting)
    plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
    plt.xlabel('Predicted Temperature')
    plt.ylabel('Residuals')
    plt.title('Gradient Boosting Residuals')
    plt.legend()
    
    # SGD Regressor subplot
    plt.subplot(4, 2, 5)
    plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
    plt.plot(forecast_data[test_mask]['timestamp'], y_pred_sgd_regressor, label=f'SGD Regressor Predicted\nMSE: {mse_sgd_regressor:.2f}, MAE: {mae_sgd_regressor:.2f}, Mean Residual: {mean_residual_error_sgd_regressor:.2f}')
    plt.xlabel('Timestamp')
    plt.ylabel('Temperature')
    plt.title('SGD Regressor')
    plt.legend()
    
    # SGD Regressor Residuals subplot
    plt.subplot(4, 2, 6)
    plt.scatter(y_pred_sgd_regressor, residuals_sgd_regressor)
    plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
    plt.xlabel('Predicted Temperature')
    plt.ylabel('Residuals')
    plt.title('SGD Regressor Residuals')
    plt.legend()
    
    # SVR subplot
    plt.subplot(4, 2, 7)
    plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
    plt.plot(forecast_data[test_mask]['timestamp'], y_pred_svr, label=f'SVR Predicted\nMSE: {mse_svr:.2f}, MAE: {mae_svr:.2f}, Mean Residual: {mean_residual_error_svr:.2f}')
    plt.xlabel('Timestamp')
    plt.ylabel('Temperature')
    plt.title('Support Vector Regressor')
    plt.legend()
    
    # SVR Residuals subplot
    plt.subplot(4, 2, 8)
    plt.scatter(y_pred_svr, residuals_svr)
    plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
    plt.xlabel('Predicted Temperature')
    plt.ylabel('Residuals')
    plt.title('Support Vector Regressor Residuals')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    plt.savefig('../output/Linear Regression Analysis Prediction Forecast Results.png')
    
    <Figure size 2000x1500 with 0 Axes>

    The results proof that an approximate forecast of the temperature for one year in the region of Bancroft in Canada is very well possible. While an exact prediction of the weather for every day is impossible to achieve so far, the scope was to predict the approximate development of the temperature over the year using linear regression models, which was possible here with a satisfying accuracy.

    5.3.4.6 Temperature Prediction (Temporal): Save model¶
    In [70]:
    # Create a meaningful name with a timestamp
    model_name_lr_3 = f"lr_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_gb_3 = f"gb_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_sgd_3 = f"sgd_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_svr_3 = f"svr_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name_lr_3, 'wb') as f:
        pickle.dump(linear_regression_model_fit, f)
    with open(folder_path + model_name_gb_3, 'wb') as f:
        pickle.dump(gradient_boosting_model_fit, f)
    with open(folder_path + model_name_sgd_3, 'wb') as f:
        pickle.dump(sgd_regressor_model_fit, f)
    with open(folder_path + model_name_svr_3, 'wb') as f:
        pickle.dump(svr_model_fit, f)
    

    5.3.5 Predicting Extreme Weather Events by Temperature (Logistic Regression)¶

    While logistic regression uses mathematical regression techniques, the result is in fact a binary classification. For this reason the logistic regression analysis is conducted in between the linear regression and classification chapters, to "build a bridge" between the topics, which helps getting a better understanding of the told data story.

    The analysis for extreme weather event prediction by temperature is conducted using logistic regression techniques. It is used to answer the research question: Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges? This question involves categorizing temperature into distinct classes and determining whether certain temperature ranges are indicative of extreme weather events. The outcome of this analysis will contribute to whether an observation is an extreme weather event, providing valuable insights into the correlation between temperature and specific meteorological phenomena. For example, it could be assumed, that extremely cold temperatures are more common in extreme weather events. Whether this statement is true, is to be analyzed in this chapter. To achieve this, this tasks also includes to find out, what the critical temperature thresholds associated with the onset or intensification of different weather events are. This question aims to identify temperature breakpoints that signify transitions between different weather conditions. Understanding these thresholds is essential for developing early warning systems and enhancing preparedness for weather-related events.

    In [71]:
    label1_counts = df['label1'].value_counts()
    print(label1_counts)
    
    label1
    1    52633
    0    12712
    Name: count, dtype: int64
    
    5.3.5.1 Predicting Extreme Weather Events by Temperature (Logistic Regression) - Select Model¶
    In [72]:
    # Create Logistic Regressuin Model
    logreg_model = LogisticRegression()
    
    5.3.5.2 Predicting Extreme Weather Events by Temperature (Logistic Regression) - Training and Validation¶
    In [73]:
    # Select the features (independent variable)
    X = df[['avg_temp']]
    
    # Select the dependent variable
    y = df['label1']
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Standardize the features (reshape to 2D array)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train.values.reshape(-1, 1))
    X_test_scaled = scaler.transform(X_test.values.reshape(-1, 1))
    
    # Print the shapes of the sets
    print("X_train shape:", X_train_scaled.shape)
    print("X_test shape:", X_test_scaled.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
    
    X_train shape: (52276, 1)
    X_test shape: (13069, 1)
    y_train shape: (52276,)
    y_test shape: (13069,)
    
    5.3.5.3 Predicting Extreme Weather Events by Temperature (Logistic Regression) - Fit Model¶
    In [74]:
    # Fit Logistic Regression model
    logreg_model.fit(X_train_scaled, y_train)
    
    # Add a constant to the features for statsmodels
    X_train_sm = sm.add_constant(X_train_scaled)
    
    # Create a logistic regression model using statsmodels
    logit_model_sm = sm.Logit(y_train, X_train_sm)
    
    # Fit the logit sm  model
    logit_model = logit_model_sm.fit()
    
    Optimization terminated successfully.
             Current function value: 0.375341
             Iterations 7
    

    5.3.5.4 Predicting Extreme Weather Events by Temperature (Logistic Regression) - Evaluation on test set¶

    In [75]:
    from yellowbrick.classifier import ConfusionMatrix, ClassificationReport
    
    # Print AIC
    aic_value = logit_model.aic
    print("AIC:", aic_value)
    print("Label 0: Extreme Weather Event \n Label 1: Blue Sky Day")
    
    # Make predictions on the test set using the result object
    X_test_sm = sm.add_constant(X_test_scaled)
    y_prob = logit_model.predict(X_test_sm)  # Probabilities
    y_pred = (y_prob > 0.5).astype(int)  # Convert probabilities to binary predictions
    
    # Confusion Matrix as Diagram with custom colors
    cm_viz = ConfusionMatrix(logreg_model, classes=[0, 1], cmap='RdYlGn')
    cm_viz.score(X_test_scaled, y_test)
    cm_viz.show()
    plt.show()
    plt.savefig('../output/Logistic Regression Predicting WEP by Temperature Confussion Matrix.png')
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    AIC: 39246.639678249965
    Label 0: Extreme Weather Event 
     Label 1: Blue Sky Day
    
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.39      0.20      0.26      2515
               1       0.83      0.93      0.88     10554
    
        accuracy                           0.79     13069
       macro avg       0.61      0.56      0.57     13069
    weighted avg       0.75      0.79      0.76     13069
    
    
    <Figure size 800x550 with 0 Axes>

    The results show, that using logistic regression analysis with only the temperature variable as predictor, does not deliver good results. Most of the extreme weather events were labelled as blue sky day events, which leads to a high false positive rate and a low recall. This would be especially dangerous in a real world implementation, as the risk of not detecting a potentially dangerous extreme weather event is high. While most of the blue sky day events (label 1) were indeed labelled as such, false negative cases are not as hazardous. The analysis needs extensive improvement as discussed in the following, after visualizing the results in a logistic regression graph for better understanding.

    5.3.5.5 Predicting WEP by Temperature (Logistic Regression) - Visualization¶
    In [76]:
    # Visualize results
    plt.figure(figsize=(12, 6))
    
    # Plot the data points on the logistic regression curve based on probabilities
    y_vals_prob = logit_model.predict(sm.add_constant(scaler.transform(X_test.values.reshape(-1, 1))))
    plt.scatter(X_test, y_test, color='blue', label='Actual Labels, yi')
    plt.scatter(X_test, y_vals_prob, color='red', marker='o', label='Predicted Probabilities, p(Xi)')
    
    # Plot the linear part of the logistic regression curve
    x_vals = np.linspace(X_test.min(), X_test.max(), 100)
    linear_combination = logit_model.params[0] + logit_model.params[1] * scaler.transform(x_vals.reshape(-1, 1))
    
    # Logistic function (sigmoid) to convert linear combination to probability
    logistic_function = 1 / (1 + np.exp(-linear_combination))
    
    plt.plot(x_vals, linear_combination, color='green', linewidth=2, label='Logit, f(x)')
    plt.plot(x_vals, logistic_function, color='orange', linewidth=2, label='Logistic Regression Curve, p(x)')
    
    # Plot the threshold line at y=0.5
    plt.axhline(y=0.5, color='purple', linestyle='--', label='Threshold = 0.5')
    
    # Set y-axis limits
    plt.ylim(-0.01, 1.01)
    
    # Display AIC value
    plt.text(plt.xlim()[1], plt.ylim()[1], f'AIC = {aic_value:.2f}', ha='right', va='top', fontsize=10, color='purple')
    
    plt.xlabel('Temperature')
    plt.ylabel('Probability of Extreme Event')
    plt.suptitle('Logistic Regression Analysis')
    plt.title('Label 0: Extreme Weather Event \n Label 1: Blue Sky Day')
    plt.legend()
    plt.show()
    plt.savefig('../output/Logistic Regression Analysis Predicting WEP by Temperature Results.png')
    
    <Figure size 800x550 with 0 Axes>

    Improving the performance of the logistic regression model:

    • Performing a multiple predictor analysis Techniques like backward elimination and forward selection can be beneficial for improving the performance of the logistic regression model. These techniques help in selecting the most relevant features and refining the model by iteratively adding or removing predictors.
      • Backward Elimination: Start with a model that includes all predictors. Iteratively remove the least significant predictor (based on p-values or other criteria) until the model performance is satisfactory or until a stopping criterion is met.
      • Forward Selection: Start with a model that includes only one predictor. Iteratively add the most significant predictor (based on p-values or other criteria) until the model performance is satisfactory or until a stopping criterion is met.
    • Hyperparameter Tuning: Experimenting with different hyperparameter values for the logistic regression model might increase the performance. Ad Grid search or random search can help find the optimal hyperparameters.
    • Regularization: Adjusting the regularization strength (e.g., using C parameter) to control overfitting might increase the performance. Furthermore it could be experimented with both L1 (Lasso) and L2 (Ridge) regularization. Also ajdusting the classification threshold might help.

    As seen in the diagramm, observations for both extreme weather and blue sky day events are distributed equally on the temperature scale, both in high and low temperatures. Thus it's not possible to easily predict extreme weather or blue sky events by temperature using a linear or logistic predictor. While the above mentioned methods could improve the results, logistic regression seems to be an inappropiate tool for the binary classification of extreme weather events. Better suited methods are discussed and applied in the following chapters.

    5.3.5.6 Predicting WEP by Temperature (Logistic Regression) - Save Model¶
    In [77]:
    # Create a meaningful name with a timestamp
    model_name_logreg= f"logreg_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name_logreg, 'wb') as f:
        pickle.dump(logreg_model, f)
    

    5.4 Classification Analysis¶

    Research Questions:

    • Extreme Weather Events: Extreme weather events such as stroms or snowfall should be predictet. This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.

    • Weather Event and Pattern Classification: Prediction of different weather patterns based on multivariate weather data. This involves using multiclass classification algorithms. The results of this classification analysis is the predection of certain weather events based on the current weather data and a model that was trained on historical weather data.

      Predicting extreme weather events is paramount for safeguarding public safety, protecting infrastructure, and ensuring food security. Early warnings enable timely evacuations and preparations, reducing the impact on lives and minimizing damage to critical structures. Additionally, forecasting various weather conditions beyond extremes is vital for everyday planning, energy management, transportation, and sustainable resource use. Accurate weather predictions contribute to resilient communities, supporting both immediate disaster response and long-term planning for a wide range of societal needs.

    5.4.1. Binary Classification¶

    Extreme Weather Events in Binary Classification: Is it possible to classify and predict extreme weather events such as storms? This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.

    In the first step, a binary classification of extreme weather events versus blue sky weather is conducted.

    Blue sky Events can be identified by the variable "label1". The variable has the following manifestations:

    • 0 : Extreme Weather
    • 1: Blue Sky
    In [108]:
    # show data as pie chart
    labels = ['Blue Sky', 'Extreme Weather']
    sizes = df['label1'].value_counts()
    colors = [ 'lightskyblue','lightcoral']
    explode = (0.1, 0)  # explode first
    
    
    # Plotting the pie chart
    plt.figure(figsize=(6, 6))
    plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct=lambda p: '{:.0f} ({:.1f}%)'.format(p * sum(sizes) / 100, p),
            shadow=True, startangle=140)
    plt.axis('equal')
    plt.title('Blue Sky Events vs. Extreme Weather Events')
    plt.show()
    plt.savefig('../output/Binary Classification Blue Sky vs Extreme Weather Events.png')
    
    <Figure size 800x550 with 0 Axes>

    First, the used data will be preprocessed:

    In [109]:
    #copy df
    df_class =df
    
    # Calculate the mean value of the column
    mean_value = df_class["avg_pressure_change"].mean()
    
    #set "avg_temp_change" to 0 in the first row, as there is no temp change
    df_class.at[0, 'avg_temp_change'] = 0
    
    #drop columns, which are not numeric and not needed
    df_class = df_class.drop(columns = ['wep','wind_direction_label', 'run_datetime','label2','avg_pressure_change','label0', 'display_label'])
    
    #show adjusted df
    df_class.head()
    
    Out[109]:
    avg_temp avg_temp_celsius min_wet_bulb_temp avg_dewpoint avg_temp_change avg_windspd max_windgust avg_winddir avg_winddir_sin avg_winddir_cos max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice label1
    0 287.389224 14.239224 280.809506 280.735246 0.000000 3.386380 14.899891 80.302464 -0.981653 0.190676 2.009 0.0 0.0 0.0 1
    1 287.378997 14.228997 280.809506 280.414058 -0.010227 3.326687 14.899891 76.866373 0.994736 0.102466 1.209 0.0 0.0 0.0 1
    2 287.388845 14.238845 280.809506 280.187074 0.009848 3.243494 14.899891 76.258867 0.758262 0.651950 0.400 0.0 0.0 0.0 1
    3 287.427324 14.277324 280.809506 280.049330 0.038479 3.145505 14.899891 78.299616 0.237897 -0.971290 0.000 0.0 0.0 0.0 1
    4 287.489158 14.339158 280.809506 279.980697 0.061834 3.047607 14.702229 84.632852 0.189006 -0.981976 0.000 0.0 0.0 0.0 1
    In [110]:
    df_class["label1"]
    
    Out[110]:
    0        1
    1        1
    2        1
    3        1
    4        1
            ..
    65340    0
    65341    1
    65342    1
    65343    1
    65344    1
    Name: label1, Length: 65345, dtype: int64
    In [111]:
    # checking for missing values in the data
    df_class.isnull().sum()
    
    Out[111]:
    avg_temp                 0
    avg_temp_celsius         0
    min_wet_bulb_temp        0
    avg_dewpoint             0
    avg_temp_change          0
    avg_windspd              0
    max_windgust             0
    avg_winddir              0
    avg_winddir_sin          0
    avg_winddir_cos          0
    max_cumulative_precip    0
    max_snow_density_6       0
    max_cumulative_snow      0
    max_cumulative_ice       0
    label1                   0
    dtype: int64

    5.4.1.1.1 Binary Classification - Select Model

    Choose Classifier Models with Lazy Classifier:

    In [112]:
    #from lazypredict.Supervised import LazyClassifier
    
    ## Define predictor variables
    X = df_class.drop(columns = ['label1', 'max_snow_density_6'])
    
    ## Define response variable
    y = df_class['label1']
    
    #X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
    
    #clf = LazyClassifier(verbose=0, ignore_warnings=False, custom_metric=None)
    #models, pred = clf.fit(X_train, X_test, y_train, y_test)
    
    #print(models)
    
    ## take too long to run
    

    Results

    Model Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
    ExtraTreesClassifier 0.99 0.99 0.99 0.99 4.13
    XGBClassifier 0.99 0.99 0.99 0.99 0.63
    RandomForestClassifier 0.99 0.99 0.99 0.99 29.54
    LGBMClassifier 0.99 0.99 0.99 0.99 0.67
    BaggingClassifier 0.99 0.98 0.98 0.99 11.56
    DecisionTreeClassifier 0.99 0.98 0.98 0.99 1.54
    AdaBoostClassifier 0.99 0.98 0.98 0.99 14.65
    LinearSVC 0.99 0.97 0.97 0.99 1.62
    CalibratedClassifierCV 0.99 0.97 0.97 0.99 1.48
    LogisticRegression 0.98 0.97 0.97 0.98 0.43
    QuadraticDiscriminantAnalysis 0.98 0.97 0.97 0.98 0.24
    GaussianNB 0.97 0.97 0.97 0.97 0.11
    SVC 0.99 0.97 0.97 0.99 22.36
    SGDClassifier 0.99 0.97 0.97 0.99 0.16
    ExtraTreeClassifier 0.97 0.96 0.96 0.97 0.12
    Perceptron 0.98 0.95 0.95 0.98 0.20
    PassiveAggressiveClassifier 0.98 0.95 0.95 0.98 0.23
    KNeighborsClassifier 0.97 0.92 0.92 0.96 7.63
    BernoulliNB 0.89 0.92 0.92 0.90 0.14
    NearestCentroid 0.83 0.86 0.86 0.85 0.13
    LinearDiscriminantAnalysis 0.91 0.76 0.76 0.90 0.37
    RidgeClassifier 0.89 0.72 0.72 0.87 0.11
    RidgeClassifierCV 0.89 0.72 0.72 0.87 0.13
    DummyClassifier 0.81 0.50 0.50 0.72 0.06

    The LazyClassifier indicates that "ExtraTreesClassifier," "XGBClassifier," "RandomForestClassifier," and "LGBMClassifier" are the top-performing classifiers for the given use case. Consequently, these classifiers will be implemented in the subsequent steps of the project. Their selection is based on their demonstrated effectiveness in handling the specific characteristics and requirements of the task at hand, as identified by the LazyClassifier's assessment. This strategic choice aims to leverage the strengths of these classifiers to achieve optimal performance and reliable results in the context of the project.

    5.4.1.1.2 Binary Classification - Training and Validation

    In [121]:
    # Split the dataset into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize the classifiers
    classifiers = {
        'ExtraTrees': ExtraTreesClassifier(),
        'XGBoost': XGBClassifier(),
        'LGBM': LGBMClassifier(verbose=-1),
        'RandomForest': RandomForestClassifier()
    }
    

    5.4.1.1.3 Binary Classification - Fit Model and Evaluate on Test Set

    In [122]:
    # Train the classifiers and collect accuracies
    accuracies = {}
    
    for idx, (name, clf) in enumerate(classifiers.items()):
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        accuracies[name] = accuracy
        print(f'{name} Accuracy: {accuracy}')
    
    ExtraTrees Accuracy: 0.9931134746346316
    XGBoost Accuracy: 0.9941847119136888
    LGBM Accuracy: 0.9935725763256561
    RandomForest Accuracy: 0.9935725763256561
    

    5.4.1.1.4 Binary Classification - Print Results

    In [123]:
    # Set up subplots
    fig, axes = plt.subplots(1, len(classifiers), figsize=(18, 5))
    
    # Iterate over classifiers and plot confusion matrices
    for ax, (name, model) in zip(axes, classifiers.items()):
        # Create a pipeline with StandardScaler and the current model
        pipe = make_pipeline(StandardScaler(), model)
        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)
    
        # Calculate confusion matrix
        cm = confusion_matrix(y_test, y_pred)
    
        # Plot heatmap
        sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square=True, cmap='Blues', ax=ax)
    
        # Set labels and title
        ax.set_xlabel('Predicted Label')
        ax.set_ylabel('True Label')
        ax.set_title(f'{name} Confusion Matrix')
    
    # Adjust layout for better spacing
    plt.tight_layout()
    
    # Show the plot
    plt.show()
    plt.savefig('../output/Binary Classification Confussion Matrix.png')
    
    <Figure size 800x550 with 0 Axes>

    It can be observed that in all models, the majority of the data has been correctly classified. Notably, the LGBM classifier exhibits the fewest instances of falsely classifying data as negative (Type II Error), meaning it misses the fewest events. Specifically, in terms of the data, this implies that the LGBM classifier wrongly categorizes the fewest blue sky events as extreme weather. On the other hand, the XGBoost Classifier has the lowest false positive score (Type I Error), indicating it most frequently issues false alarms. In the present data, this error is more critical, as it means predicting no severe weather when it is, in fact, occurring.

    In [124]:
    # Iterate over classifiers and plot confusion matrices
    for ax, (name, model) in zip(axes, classifiers.items()):
        # Create a pipeline with StandardScaler and the current model
        pipe = make_pipeline(StandardScaler(), model)
        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)
    
        # Calculate classification report with more decimal places
        report = classification_report(y_test, y_pred, digits=4)
    
        print(name, ":\n", report)
    
    ExtraTrees :
                   precision    recall  f1-score   support
    
               0     0.9911    0.9742    0.9826      2515
               1     0.9939    0.9979    0.9959     10554
    
        accuracy                         0.9933     13069
       macro avg     0.9925    0.9860    0.9892     13069
    weighted avg     0.9933    0.9933    0.9933     13069
    
    XGBoost :
                   precision    recall  f1-score   support
    
               0     0.9931    0.9753    0.9842      2515
               1     0.9942    0.9984    0.9963     10554
    
        accuracy                         0.9940     13069
       macro avg     0.9936    0.9869    0.9902     13069
    weighted avg     0.9940    0.9940    0.9939     13069
    
    LGBM :
                   precision    recall  f1-score   support
    
               0     0.9947    0.9726    0.9835      2515
               1     0.9935    0.9988    0.9961     10554
    
        accuracy                         0.9937     13069
       macro avg     0.9941    0.9857    0.9898     13069
    weighted avg     0.9937    0.9937    0.9937     13069
    
    RandomForest :
                   precision    recall  f1-score   support
    
               0     0.9947    0.9714    0.9829      2515
               1     0.9932    0.9988    0.9960     10554
    
        accuracy                         0.9935     13069
       macro avg     0.9940    0.9851    0.9894     13069
    weighted avg     0.9935    0.9935    0.9935     13069
    
    

    The classification reports provide an evaluation of the performance of different models.

    ExtraTrees: The ExtraTrees model demonstrates high precision and recall for both classes, achieving an accuracy of 99.30%. The precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1) are consistently high, indicating robust performance across both classes.

    XGBoost: The XGBoost model exhibits excellent precision, recall, and F1-score for both classes, resulting in an overall accuracy of 99.40%. Similar to ExtraTrees, it shows strong performance in correctly classifying both extreme weather and blue sky events.

    LightGBM: The LightGBM model achieves a high accuracy of 99.37%, with impressive precision, recall, and F1-score for both classes. Notably, it maintains a high recall for extreme weather events (0), ensuring that a significant proportion of these events are correctly identified.

    RandomForest: The RandomForest model performs well, achieving an accuracy of 99.32%. It shows strong precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1), indicating reliable performance across different weather scenarios.

    In summary, all four models—ExtraTrees, XGBoost, LightGBM, and RandomForest—demonstrate robust performance in classifying weather events, with high accuracy and consistent precision and recall metrics across the evaluated classes.

    5.4.1.1.5 Binary Classification - Save Model

    In [90]:
    for ax, (name, model) in zip(axes, classifiers.items()):
    
        model_name= f"{name}_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
        # Specify the folder path
        folder_path = "../models/"
    
        # Save the model to the specified folder with the created name
        with open(folder_path + model_name, 'wb') as f:
            pickle.dump(model, f)
    

    5.4.2 Multiclass Classification¶

    After successfully predicting extreme weather and blue sky day weather events, a key result of this research should be the prediction of specific extreme weather events. Once it is determined, that an observation is an extreme weather event, it's important to analyse what specific kind of extreme weather event it is. These results can then be used by scientists and governmental institutions to take countermeasures to prevent damage and minimize the risk for a weather event to be hazardous. The analysis for the classification of specific weather events and patterns is conducted using multiclass classification techniques. The research question to be answered ist: Is it possible to categorize and predict different extreme weather events based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the prediction of certain weather events based on the current weather data and a model that was trained on historical weather data.

    In [91]:
    # Creating new dataframe that only includes extreme weather observations
    df_extreme_weather = pd.DataFrame(df)
    df_extreme_weather = df_extreme_weather[df_extreme_weather['label1'] != 1]
    
    # Dropping NaN
    df_extreme_weather = df_extreme_weather.dropna()
    df_extreme_weather
    
    Out[91]:
    run_datetime wep avg_temp avg_temp_celsius min_wet_bulb_temp avg_dewpoint avg_temp_change avg_windspd max_windgust avg_winddir ... wind_direction_label max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice avg_pressure_change label0 label1 label2 display_label
    2238 2015-10-16 13:00:00 Moderate rain 276.935854 3.785854 271.085195 273.479160 -0.196776 2.610431 12.095491 302.159235 ... Northwest 6.115 0.000000 24.979 0.0 25.754899 4 0 3.0 Moderate rain
    2239 2015-10-16 14:00:00 Moderate rain 276.763328 3.613328 271.085195 273.237344 -0.172526 2.657227 12.095491 302.695638 ... Northwest 6.215 0.000000 25.897 0.0 26.257896 4 0 3.0 Moderate rain
    2240 2015-10-16 15:00:00 Moderate rain 276.597296 3.447296 271.085195 272.988734 -0.166032 2.700702 12.095491 303.377860 ... Northwest 6.315 0.000000 26.591 0.0 25.630187 4 0 3.0 Moderate rain
    2241 2015-10-16 16:00:00 Moderate rain 276.430385 3.280385 271.085195 272.769039 -0.166911 2.743809 12.095491 304.563206 ... Northwest 6.415 0.000000 27.208 0.0 24.500756 4 0 3.0 Moderate rain
    2242 2015-10-16 17:00:00 Moderate rain 276.242514 3.092514 271.085195 272.543725 -0.187871 2.778891 12.095491 306.064377 ... Northwest 6.515 0.000000 27.471 0.0 25.055266 4 0 3.0 Moderate rain
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    56604 2021-12-28 04:00:00 Mild snowfall 270.414644 -2.735356 263.582322 267.519393 0.158137 2.467546 16.219425 214.752551 ... Southwest 2.007 53.587745 39.138 0.0 20.613530 5 0 5.0 Mild snowfall
    56605 2021-12-28 05:00:00 Mild snowfall 270.540634 -2.609366 263.824712 267.697892 0.125990 2.431944 16.219425 211.899440 ... Southwest 1.807 53.587745 38.076 0.0 21.978703 5 0 5.0 Mild snowfall
    56606 2021-12-28 06:00:00 Mild snowfall 270.638388 -2.511612 264.198556 267.826681 0.097754 2.412873 16.219425 209.410100 ... Southwest 1.509 53.587745 32.855 0.0 23.830981 5 0 5.0 Mild snowfall
    56607 2021-12-28 07:00:00 Mild snowfall 270.714928 -2.435072 264.198556 267.913728 0.076540 2.405941 16.219425 206.407704 ... Southwest 1.313 53.587745 26.111 0.0 27.455241 5 0 5.0 Mild snowfall
    56608 2021-12-28 08:00:00 Mild snowfall 270.801398 -2.348602 264.705794 268.001076 0.086470 2.420409 16.219425 201.798103 ... South 1.113 53.587745 17.846 0.0 26.719272 5 0 5.0 Mild snowfall

    11109 rows × 22 columns

    In [92]:
    knn_model = KNeighborsClassifier(
        n_neighbors=5, 
        weights='uniform', 
        algorithm='auto', 
        leaf_size=30, 
        p=2, 
        metric='minkowski', 
        metric_params=None, 
        n_jobs=None)
    
    svm_model = svm.SVC(
        C=1.0, 
        kernel='rbf', 
        degree=3, 
        gamma='scale', 
        coef0=0.0, 
        tol=0.001, 
        cache_size=800, 
        class_weight='balanced', 
        random_state=42)
    
    dtc_model = DecisionTreeClassifier(
        criterion="entropy", 
        class_weight="balanced", 
        max_depth=10, 
        min_samples_leaf=3, 
        max_features=None, 
        random_state=42)
    
    gbc_model = GradientBoostingClassifier(
        loss='log_loss', 
        learning_rate=0.1, 
        n_estimators=100, 
        subsample=1.0, 
        criterion='friedman_mse', 
        min_samples_split=2, 
        min_samples_leaf=3, 
        min_weight_fraction_leaf=0.0, 
        max_depth=10, 
        )
    

    5.4.2.2 Multiclass Classification - Training and Validation

    In [93]:
    # Visual inspection of the class distribution
    wep_counts = df_extreme_weather['wep'].value_counts()
    print(wep_counts)
    sns.countplot(x = 'wep', data=df_extreme_weather, palette = "Set1")
    
    wep
    Mild snowfall                                          3892
    Moderate snowfall                                      2483
    Moderate rain                                          2322
    Heavy snowfall with accumulated snow                    926
    Frondurchlauf / Continuous freezing rain                875
    Storm with freezing rain / Heavy snow- and icestorm     401
    Snowstorm with high precipitation                       210
    Name: count, dtype: int64
    
    Out[93]:
    <Axes: xlabel='wep', ylabel='count'>

    Mapping labels and Events:

    label2 wep
    0.0 Moderate snowfall
    1.0 Heavy snowfall with accumulated snow
    2.0 Storm with freezing rain / Heavy snow- and ice pellets
    3.0 Moderate rain
    4.0 Snowstorm with high precipitation
    5.0 Mild snowfall
    6.0 Frondurchlauf / Continuous freezing rain
    In [94]:
    # Defining the dependent and independent variables
    X = df_extreme_weather.drop(columns=["label0", "label1", "label2", "wep", "avg_winddir", "wind_direction_label", "avg_temp_celsius", "run_datetime","display_label"])# Features
    y = df_extreme_weather['label2'] # Labels
    
    In [95]:
    # Split into train and test
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    print('X_train count: ', X_train.shape[0])
    print('X_test count: ', X_test.shape[0])
    print('Y_train count: ', Y_train.shape[0])
    print('Y_test count: ', Y_test.shape[0])
    
    X_train count:  7776
    X_test count:  3333
    Y_train count:  7776
    Y_test count:  3333
    

    Excursus on Oversampling in Data Imbalance At this point applying oversampling techniques like SMOTE were planned to be used to handle the imbalanced dataset. It is a technique addressing imbalanced datasets, involves artificially boosting the minority class during model training. While effective, its application should be approached judiciously. During iterative prototyping it was discovered that the initial results without oversampling, using the imbalanced data, are satisfactory and the model shows proficiency in handling imbalances without oversampling. Introducing oversampling at this point whatsoever may risk overfitting and complicate the model without significant gains. Decisions on oversampling is hinge on a nuanced understanding of the dataset, continual model evaluation, and the potential trade-offs introduced by altering the class distribution. It's a delicate balance between rectifying imbalance and preserving the integrity of the underlying data patterns. For these reasons, ovesampling techniques were considered here, but the code was deleted after all, since the evaluation results are satisfactory and overfitting on the data and complicating the model is avoided 1.


    1. Mohammed, R., Rawashdeh, J. and Abdullah, M., 2020, April. Machine learning with oversampling and undersampling techniques: overview study and experimental results. In 2020 11th international conference on information and communication systems (ICICS) (pp. 243-248). IEEE.↩

    5.4.2.3 Multiclass Classification - Fit Model

    In [96]:
    knn_model_fit = knn_model.fit(X_train, Y_train)
    svm_model_fit = svm_model.fit(X_train, Y_train)
    dtc_model_fit = dtc_model.fit(X_train, Y_train)
    gbc_model_fit = gbc_model.fit(X_train, Y_train)
    

    5.4.2.4 Multiclass Classification - Evaluation on Test Set

    In [97]:
    # Predict using the trained KNN model on the test set
    Y_pred_knn = knn_model_fit.predict(X_test)
    
    # Generate the confusion matrix
    cm_knn = confusion_matrix(Y_test, Y_pred_knn)
    
    # Visualize the Confusion Matrix
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues')
    plt.title('K-Nearest Neighbors Confusion Matrix ')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
    plt.savefig('../output/Multiclass Classification KNN Confusion Matrix.png')
    
    # Display the classification report
    print("Classification Report:\n",classification_report(Y_test, Y_pred_knn))
    
                  precision    recall  f1-score   support
    
             0.0       0.88      0.88      0.88       741
             1.0       0.78      0.82      0.80       274
             2.0       0.88      0.78      0.82       120
             3.0       1.00      1.00      1.00       679
             4.0       0.97      0.85      0.90        66
             5.0       0.95      0.97      0.96      1190
             6.0       0.84      0.80      0.82       263
    
        accuracy                           0.92      3333
       macro avg       0.90      0.87      0.88      3333
    weighted avg       0.92      0.92      0.92      3333
    
    
    <Figure size 800x550 with 0 Axes>

    The multiclass classification with K-Nearest Neighbors generally yields good results, with the actual outcomes closely aligning with the predictions.

    The classification report provides a comprehensive overview of the model's performance across multiple classes. Precision, representing the accuracy of positive predictions, is generally high, ranging from 78% to 100%. Recall, which measures the ability of the model to capture all relevant instances, shows consistent performance, with values ranging from 78% to 100%. The F1-score, which considers both precision and recall, is strong across most classes, indicating a balanced trade-off between precision and recall. The macro average F1-score is 0.88, suggesting a good overall performance. The weighted average F1-score, considering class imbalance, is also 0.92, demonstrating the model's effectiveness in making accurate predictions across the entire dataset. The high accuracy of 92% further supports the model's reliability in classifying instances from diverse classes. Overall, the model exhibits robust performance across various metrics, showcasing its ability to handle multiclass classification tasks effectively.

    In [98]:
    # Predict using the trained SVM model on the test set
    Y_pred_svm = svm_model_fit.predict(X_test)
    
    # Generate the confusion matrix
    cm_svm = confusion_matrix(Y_test, Y_pred_svm)
    
    # Visualize the Confusion Matrix
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues')
    plt.title('Support Vector Machine Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
    plt.savefig('../output/Multiclass Classification SVM Confusion Matrix.png')
    
    # Display the classification report
    print("Classification Report:\n",classification_report(Y_test, Y_pred_svm))
    
                  precision    recall  f1-score   support
    
             0.0       0.81      0.61      0.69       741
             1.0       0.41      0.69      0.52       274
             2.0       0.78      0.55      0.64       120
             3.0       0.95      1.00      0.98       679
             4.0       0.70      0.94      0.80        66
             5.0       0.94      0.92      0.93      1190
             6.0       0.68      0.67      0.68       263
    
        accuracy                           0.82      3333
       macro avg       0.75      0.77      0.75      3333
    weighted avg       0.84      0.82      0.82      3333
    
    
    <Figure size 800x550 with 0 Axes>

    The Support Vector Machine (SVM) exhibits a higher rate of misclassifying events compared to the K-Nearest Neighbors (KNN) algorithm. Specifically, Class 0 experiences a frequent misprediction, often being classified as Class 1 by the SVM. This discrepancy suggests a higher tendency for the SVM to incorrectly assign instances from Class 0 to Class 1, potentially indicating challenges in distinguishing between these two classes. The performance disparity between SVM and KNN highlights the importance of considering the specific characteristics of the dataset and the nature of the classes when selecting an appropriate classification algorithm.
    The classification report reveals some variations in the model's performance across different classes compared to the previous report. For class 0.0, precision has slightly decreased to 0.81, indicating that the positive predictions for this class are now at a slightly lower accuracy. Recall for class 1.0 has improved to 0.69, signifying an enhancement in capturing more relevant instances, and the F1-score for this class has also increased to 0.52. Class 2.0 shows an improvement in precision (0.78), but a decrease in recall (0.55), resulting in a slightly lower F1-score of 0.64. Class 4.0 exhibits a notable increase in precision (0.70) and a slight decrease in recall (0.94), leading to a higher F1-score of 0.80. Overall, the macro-average precision and recall remain relatively consistent at 0.75 and 0.77, respectively, contributing to a macro-average F1-score of 0.75. The weighted average F1-score stands at 0.82, indicating an overall improvement in the model's ability to balance precision and recall across the dataset, with an accuracy of 82%.

    In [99]:
    from sklearn.tree import plot_tree
    
    # Make predictions using the Decision Tree Classifier (DTC) model on the test set
    Y_pred_dtc = dtc_model_fit.predict(X_test)
    
    # Calculate the confusion matrix
    cm_dtc = confusion_matrix(Y_test, Y_pred_dtc)
    
    # Visualize the Confusion Matrix using seaborn heatmap
    plt.figure(figsize=(10,7))
    sns.heatmap(cm_dtc, annot=True, fmt='d', cmap='Blues')  
    plt.title('Decision Tree Classifier confusion matrix ')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
    
    # Save the confusion matrix plot
    plt.savefig('../output/Multiclass Classification DTC Confussion Matrix.png')
    
    # Print the classification report
    print("Classification Report:\n",classification_report(Y_test, Y_pred_dtc))
    
                  precision    recall  f1-score   support
    
             0.0       0.94      0.92      0.93       741
             1.0       0.84      0.93      0.88       274
             2.0       0.89      0.91      0.90       120
             3.0       1.00      0.99      1.00       679
             4.0       0.95      0.94      0.95        66
             5.0       0.98      0.97      0.98      1190
             6.0       0.92      0.92      0.92       263
    
        accuracy                           0.95      3333
       macro avg       0.93      0.94      0.94      3333
    weighted avg       0.96      0.95      0.95      3333
    
    
    <Figure size 800x550 with 0 Axes>

    The Decision Tree Classifier predicts all weather events well. It exhibits commendable performance in classifying weather events across multiple metrics. It achieves high precision, recall, and F1-score for most weather event classes, indicating its effectiveness in making accurate predictions. The classifier demonstrates particularly strong performance for classes 3.0, 4.0, and 5.0, achieving perfect precision and recall values. The overall accuracy of 95% underscores the classifier's ability to correctly classify the majority of instances. Comparing this result to previous models, the Decision Tree Classifier performs well. The Decision Tree's interpretability and simplicity make it a valuable model, but its performance may be surpassed by more sophisticated models in certain scenarios.

    In [ ]:
    # Visualize the Decision Tree using plot_tree from sklearn
    plt.figure(figsize=(40,20))
    fig = plot_tree(dtc_model,
                    filled = True,
                    rounded = True,
                    fontsize = 14)
    
    In [100]:
    # Make predictions using the Gradient Boosting Classifier (GBC) model on the test set
    Y_pred_gbc = gbc_model_fit.predict(X_test)
    
    # Calculate the confusion matrix
    cm_gbc = confusion_matrix(Y_test, Y_pred_gbc)
    
    # Visualize the Confusion Matrix using seaborn heatmap
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm_gbc, annot=True, fmt='d', cmap='Blues')  
    plt.title('Gradient Boosting Classifier confusion matrix ')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()
    
    # Save the confusion matrix plot
    plt.savefig('../output/Multiclass Classification GBC Confusion Matrix.png')
    
    # Print the classification report
    print("Classification Report:\n",classification_report(Y_test, Y_pred_gbc))
    
    # Visualize the Decision Tree using plot_tree from sklearn (assuming it's a GBC model)
    plt.figure(figsize=(16, 9))
    
                  precision    recall  f1-score   support
    
             0.0       0.97      0.97      0.97       741
             1.0       0.93      0.96      0.94       274
             2.0       0.96      0.92      0.94       120
             3.0       1.00      1.00      1.00       679
             4.0       0.98      0.94      0.96        66
             5.0       0.99      0.99      0.99      1190
             6.0       0.95      0.95      0.95       263
    
        accuracy                           0.98      3333
       macro avg       0.97      0.96      0.96      3333
    weighted avg       0.98      0.98      0.98      3333
    
    
    Out[100]:
    <Figure size 26500x11500 with 0 Axes>
    <Figure size 800x550 with 0 Axes>
    <Figure size 26500x11500 with 0 Axes>

    The Confusion Matrix for the Gradient Boosting Classifier shows very well performance as most labels are predicted correctly.

    The Classification Report demonstrates remarkable performance across multiple metrics. It achieves high precision, recall, and F1-score for most weather event classes. Notably, it maintains precision rates above 94% for all classes, indicating a low false positive rate. The recall values, reflecting the ability to correctly identify instances of each class, are consistently high, ranging from 92% to 100%. The overall accuracy of 98% further emphasizes the classifier's proficiency in correctly classifying instances. Compared to the preceding models, the Gradient Boosting Classifier stands out with superior accuracy and balanced performance across various weather event classes. Its ability to handle complex relationships within the data and make accurate predictions makes it a robust choice for this classification task.

    5.4.2.5 Multiclass Classification - Save Model

    In [101]:
    # Create a meaningful name with a timestamp
    model_name_knn = f"knn_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_svm = f"svm_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_dtc = f"dtc_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    model_name_gbc = f"gbc_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
    
    # Specify the folder path
    folder_path = "../models/"
    
    # Save the model to the specified folder with the created name
    with open(folder_path + model_name_knn, 'wb') as f:
        pickle.dump(knn_model_fit, f)
    with open(folder_path + model_name_svm, 'wb') as f:
        pickle.dump(svm_model_fit, f)
    with open(folder_path + model_name_dtc, 'wb') as f:
        pickle.dump(dtc_model_fit, f)
    with open(folder_path + model_name_gbc, 'wb') as f:
        pickle.dump(gbc_model_fit, f)
    

    The analysis of the classification reports provides valuable insights into the performance of different classifiers across multiple weather event labels. The Extra Trees, XGBoost, and Random Forest classifiers consistently demonstrate high precision, recall, and F1-score across various weather event categories, showcasing their effectiveness in accurately predicting events. The Support Vector Machine (SVM) tends to misclassify events more frequently, especially for class 0, often predicting it as class 1. On the other hand, the K-Nearest Neighbors (KNN) algorithm generally performs well, with actual results closely aligning with predictions. The Gradient Boosting Classifier outshines other models, exhibiting exceptional precision, recall, and F1-score for most classes and achieving an impressive overall accuracy of 98%. The Decision Tree Classifier also performs well, with high precision and recall values for most classes, resulting in an accuracy of 95%. Comparatively, the SVM faces challenges in achieving robust performance across all classes, while the KNN algorithm proves to be reliable. The Gradient Boosting and Decision Tree classifiers emerge as top performers, providing accurate predictions across a diverse range of weather event labels.

    Generally, the results for the multiclass classification analysis are excellent, proofing that extreme weather events can be predicted with a very high accuracy using mutliclass classification techniques.

    6 Conclusions¶

    Overview¶

    In this exploratory data analysis project, titled "Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset," an endeavor was made to understand and predict weather patterns in the Bancroft region of Ontario, Canada, utilizing the ERA5 dataset from 2015 to 2022. The project was motivated by the significant role that accurate weather prediction plays in various sectors and the unique climate characteristics of the region, influenced by the lake effect. The complex task of weather prediction was approached through regression and classification analyses, with the aim of modeling weather parameters and predicting weather events, including extreme conditions.

    Regression Analysis Findings¶

    The regression analyses in chapter 5.3.4. aimed to forecast the temperature with historical data. The research question "Is it possible to build an accurate regression model to predict temperature based on historical data?" can be answered with yes. The results indicated that while an exact daily weather forecast remains challenging, satisfactory levels of accuracy in predicting the temperature forecast for the general temperature trends over the year using linear regression models were achieved. Support Vector Regressor emerged as the most effective model in this context.

    However, predicting the temperature using the wind speed variable in a linear regression analysis (5.3.1.) or a paramerer mix of different variables in a multiple predictor linear regression analysis (5.3.2.) did not proof to be successfull. The research question "Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques?" can therefore be answered with no. Using the given data, linear regression models cannot be successfully, with a satisfactory accuracy, be used to predict the temperature with the wind variables or a mix of variables. The analyses proofed, that the cause of this is for one the mostly non-linear relationship of the parameters, which proofs to be a lot more complex. The correlation of the parameters also was not sufficiently high for a linear regression analysis. For these reasons it was decided that the further approach is to continue the analysis with logistic regression and classification techniques.

    Also, the SARIMAX model used for temperature and wind modeling showed a systematic bias, consistently overestimating temperatures. This highlighted the limitations of using SARIMAX for predicting temperature variations based on wind, prompting the need for further refinement and exploration of alternative modeling approaches.

    The last regression analysis was using logistic regression techniques to classify extreme weather and bluse sky day events in a binary classification. The research question was "Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges?". The approach to predict extreme weather events based solely on temperature using logistic regression's techniques (5.3.5) was insufficient, often misclassifying extreme events as normal. This underscored the necessity for more complex or multivariate approaches to accurately anticipate hazardous weather conditions. Instead of further optimizing the logistic regression approach (e.g., hyperparameter tuning or multiple predictor regression), it was decided that the more promising approach would be to identify further binary classifiers as presented in the classification analysis chapters.

    Classification Analysis Findings¶

    In binary classification the goal was to predict whether an observation was an extreme weather or blue sky day event (5.4.1.). The asked research question was "Is it possible to classify and predict extreme weather events such as storms?". It was identified that extreme weather events can indeed very accurately be separated from blue sky day events and both classes can be predicted with a very high accuracy, precision and recall. "ExtraTreesClassifier," "XGBClassifier," "RandomForestClassifier," and "LGBMClassifier" are the top-performing classifiers based on LazyClassifier's assessment. Each demonstrated high accuracy, with XGBoost slightly leading the pack. These models proved effective in categorizing and predicting weather events from the given data, providing valuable tools for future weather prediction endeavors. The results of this analysis could then be used in multiclass classification, to determine the specific type of extreme weather event.

    The multiclass classification further nuanced the understanding of various weather events (5.4.2.). The goal was to determine and classify the specific type of extreme weather event, answering the research question "Is it possible to categorize and predict different extreme weather events based on multivariate weather data?". The research question can be answered with yes, the prediction and categorization of various extreme weather events is possible with a very high accuracy, precision and recall. Gradient boosting emerged as a particularly potent method, achieving high precision, recall, and F1-scores across all classes. This success illustrates the potential of sophisticated classification algorithms in deciphering complex weather patterns and predicting diverse weather events. This knowlegde can then also be used by scientists for further research for governmental institutions, e.g., when it comes to taking countermeasures to prevent damage from certain extreme weather events and minimize the risks and dangers.

    Critical reflection and outlook¶

    This project's journey through regression and classification analyses of weather data has provided a deeper insight into the atmospheric dynamics of Bancroft, Ontario. While profound analyses and research in temperature trend prediction and weather event classification was made, the challenges and inaccuracies encountered remind of the complexities inherent in meteorological studies. Predicting the weather accurately remains a daunting task, demanding continual refinement of models and methods.

    Critically reflecting on the used methodology and procedure, it can be said, that after conducting the EDA, which showed, that wind and temperature variables do not correlate or show association, further regression analyses of these parameters could have been discontinued there. But since scientific literature discussed, that such an analysis proofs to be valuable and other researchers already made successfull aproaches with those techniques on this data, the approach was continued, to further evaluate patterns and characteristics of temperature and wind parameters and their correlations. Especially the mutliple predictor analysis promised to deliver interesting results and after all, the research question and hypotheses could succesfully be refuted, which in the end is a very valuable contribution to science.

    Furthermore, the original ERA5 dataset was reduced using PCA and feature forward selection / feature backward deletion to conduct analysis on feature relevance. Only a reduced dataset of parameters was used in the final analyses. In further research other variables and their association / correlation could be analyzed. For example, literature suggests that cloud cover has a strong influence on air temperature. Using the same methodology and implemenation as in this assignment, simply switching the parameters, additional research with valuable insights into meteorological data can be conducted, resulting in better results.

    Moreover, the generalizability of the data has to be scrutinized, as the analyses are based on regional data and meteorological data is always very biased towards regional effects. Conducting the same analysis on data for regions in e.g., African countries might result in completely different results.

    The unpredictability and irregularity of weather and temperature patterns is a crucial aspect that adds complexity to the analysis. Acknowledging the irregular nature of meteorological phenomena emphasizes the inherent challenges in making precise predictions, even with advanced analytical techniques.

    While the analyses presented valuable insights, the potential for further optimization, including hyperparameter tuning, remains. Exploring these avenues could enhance the performance of the models and provide more accurate predictions, especially in the context of fine-tuning parameters for machine learning algorithms.

    The exploration of weather patterns could be expanded to include a dedicated analysis of trends related to climate change. Understanding the long-term impacts on temperature and weather events could provide valuable insights into the broader environmental context.

    Acknowledging that certain factors, such as climate change and omitted variable bias, might not have been explicitly addressed in the analysis, highlights the potential sources of variance and errors. Future research could delve deeper into these factors to refine models and improve predictive accuracy. It's essential to recognize that certain aspects, possibly including external factors or variables not considered in the analysis, might influence the weather and temperature trends. Acknowledging what falls out of the scope of the current study adds a layer of humility to the findings and encourages future researchers to explore additional dimensions.

    In summary, while the project has contributed significantly to understanding and predicting weather patterns, there exist additional dimensions and challenges that warrant exploration. The complexities of meteorological studies, coupled with the acknowledgment of unpredictable weather dynamics, call for continual refinement, optimization, and consideration of broader environmental factors for a comprehensive understanding of weather phenomena. Our findings contribute to the broader discourse on weather prediction, emphasizing the need for multidimensional approaches and the potential of machine learning techniques. As climate variability continues to present profound challenges, the insights garnered here offer a stepping stone towards more accurate, reliable, and comprehensive weather forecasting methods. Moving forward, integrating more diverse datasets, refining models, and exploring new methodologies will be crucial in enhancing the predictive capabilities and understanding of weather patterns. This endeavor not only aids in better forecasting but also in strategic planning and preparedness for the diverse impacts of weather and climate change across sectors.

    In [102]:
    !pip freeze requirements.txt
    
    aiohttp==3.9.1
    aiosignal==1.3.1
    alembic==1.13.0
    altair==5.2.0
    annotated-types==0.6.0
    anyio==3.7.1
    asttokens==2.4.1
    attrs==23.1.0
    certifi==2023.11.17
    charset-normalizer==3.3.2
    click==8.1.7
    colorama==0.4.6
    colorlog==6.8.0
    comm==0.2.0
    contourpy==1.2.0
    cycler==0.12.1
    Cython==3.0.6
    dataclasses-json==0.6.3
    debugpy==1.8.0
    decorator==5.1.1
    distlib==0.3.7
    executing==2.0.1
    fastapi==0.104.1
    fastjsonschema==2.19.0
    filelock==3.13.1
    fonttools==4.45.1
    frozenlist==1.4.0
    greenlet==3.0.1
    idna==3.6
    imbalanced-learn==0.11.0
    imblearn==0.0
    ipykernel==6.26.0
    ipython==8.17.2
    jedi==0.19.1
    Jinja2==3.1.2
    joblib==1.3.2
    jsonpatch==1.33
    jsonpointer==2.4
    jsonschema==4.20.0
    jsonschema-specifications==2023.11.1
    jupyter_client==8.6.0
    jupyter_core==5.5.0
    kiwisolver==1.4.5
    langchain==0.0.346
    langchain-core==0.0.10
    langsmith==0.0.69
    lazypredict==0.2.12
    lightgbm==4.1.0
    Mako==1.3.0
    MarkupSafe==2.1.3
    marshmallow==3.20.1
    matplotlib==3.8.2
    matplotlib-inline==0.1.6
    multidict==6.0.4
    mypy-extensions==1.0.0
    nbformat==5.9.2
    nest-asyncio==1.5.8
    numpy==1.26.2
    opensearch-py==2.4.2
    optuna==3.4.0
    packaging==23.2
    pandas==2.1.3
    parso==0.8.3
    patsy==0.5.4
    Pillow==10.1.0
    platformdirs==4.0.0
    plotly==5.18.0
    pmdarima==2.0.4
    podman==4.8.0.post1
    podman-compose==1.0.6
    prompt-toolkit==3.0.40
    protobuf==4.25.1
    psutil==5.9.6
    pure-eval==0.2.2
    pyarrow==14.0.1
    pydantic==2.5.2
    pydantic_core==2.14.5
    Pygments==2.16.1
    pyparsing==3.1.1
    PyPDF2==3.0.1
    python-dateutil==2.8.2
    python-dotenv==1.0.0
    pytz==2023.3.post1
    pywin32==306
    pyxdg==0.28
    PyYAML==6.0.1
    pyzmq==25.1.1
    referencing==0.31.0
    requests==2.31.0
    rpds-py==0.13.1
    scikit-learn==1.3.2
    scipy==1.11.4
    seaborn==0.13.0
    setuptools==69.0.2
    six==1.16.0
    skforecast==0.11.0
    sniffio==1.3.0
    SQLAlchemy==2.0.23
    stack-data==0.6.3
    starlette==0.27.0
    statsmodels==0.14.0
    tenacity==8.2.3
    threadpoolctl==3.2.0
    toolz==0.12.0
    tornado==6.3.3
    tqdm==4.66.1
    traitlets==5.13.0
    typing-inspect==0.9.0
    typing_extensions==4.8.0
    tzdata==2023.3
    urllib3==2.1.0
    vegafusion==1.4.5
    vegafusion-python-embed==1.4.5
    virtualenv==20.25.0
    vl-convert-python==1.1.0
    wcwidth==0.2.10
    xgboost==2.0.2
    yarl==1.9.4
    yellowbrick==1.5